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Preface 



This volume contains the proceedings of AI 2001, the 14th Australian Joint Conference 
on Artificial Intelligence. The aim of this conference series is to support the Australian 
artihcial intelligence community with a forum for discussion and presentation. As before, 
this conference not only brought together a great deal of Australian AI research, but also 
attracted widespread international interest. 

The conference this year saw an impressive array of about 1 10 submitted papers from no 
fewer than 16 countries. Full-length versions of all submitted papers were refereed by the 
international program committee. As a result, these proceedings contain 55 papers not 
just from Australia, but also Canada, France, Germany, The Netherlands, Japan, Korea, 
New Zealand, and the UK and USA. 

The conference also comprised a tutorial program and several workshops, and featured 
five invited speakers on theoretical, philosophical, and applied topics: Didier Dubois of 
the Universite Paul Sabatier, James Hendler of the University of Maryland, Liz Sonen- 
berg of the University of Melbourne, Peter Struss of OCC’M Software and TU Munich, 
and Alex Zelinsky of the Australian National University. 

We extend our thanks to the members of the program committee who processed a large 
review workload under tight time constraints. We especially thank our host. Professor 
Robin King, Pro Vice Chancellor of the Division of Information Technology, Enginee- 
ring, and the Environment at UniSA, for providing infrastructure and financial support. 
We are also grateful to the US Air Force Office of Scientific Research, Asian Office of 
Aerospace Research and Development, and the Commonwealth Defence Science and 
Technology Organisation for their financial support. Finally we would like to thank all 
those who contributed to the conference organization, without their help the conference 
could not have taken place. 
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A Memetic Pareto Evolutionary Approach to 
Artificial Neural Networks 



H.A. Abbass 

University of New South Wales, School of Computer Science, ADFA Campus, 
Canberra, ACT 2600, Australia, h.abbass@adfa.edu.au. 



Abstract. Evolutionary Artificial Neural Networks (EANN) have been 
a focus of research in the areas of Evolutionary Algorithms (EA) and 
Artifieial Neural Networks (ANN) for the last decade. In this paper, we 
present an EANN approach based on pareto multi-objective optimization 
and differential evolution augmented with local search. We call the ap- 
proach Memetic Pareto Artificial Neural Networks (MPANN). We show 
empirically that MPANN is capable to overcome the slow training of 
traditional EANN with equivalent or better generalization. 

Keywords: neural networks, genetic algorithms 



1 Introduction 

Evolutionary Artificial Neural Networks (EANNs) have been a key research area 
for the last decade. On the one hand, methods and techniques have been devel- 
oped to find better approaches for evolving Artificial Neural Networks and more 
precisely - for the sake of our paper - Multi-layer feed-forward Artificial Neural 
Networks (ANNs). On the other hand, finding a good ANNs’ architecture has 
been an issue as well in the field of ANNs. Methods for network growing (such as 
Cascade Correlation [4]) and for network pruning (such as Optimal Brain Dam- 
age [14]) have been used to overcome the long process for determining a good 
network architecture. However, all these methods still suffer from their slow con- 
vergence and long training time. In addition, they are based on gradient-based 
techniques and therefore can easily stuck in a local minimum. EANNs provide 
a better platform for optimizing both the network performance and architec- 
ture simultaneously. Unfortunately, all of the research undertaken in the EANN 
literature ignores the fact that there is always a trade-off between the architec- 
ture and the generalization ability of the network. A network with more hidden 
units may perform better on the training set, but may not generalize well on the 
test set. This trade-off is a well known problem in Optimization known as the 
Multi-objective Optimization Problem (MOP). 

With the trade-off between the network architecture - taken in this paper 
to be the number of hidden units - and the generalization error, the EANN 
problem is in effect a MOP. It is, therefore, natural to raise the question of why 
not applying a multi-objective approach to EANN. 
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The objective of this paper is to present a Memetic {ie. evolutionary algo- 
rithms (EAs) augmented with local search [18]) Pareto Artificial Neural Net- 
works (MPANN). The rest of the paper is organized as follows: In Section 2, 
background materials are covered followed by an explanation of the methods 
in Section 3. Results are discussed in Section 4 and conclusions are drawn in 
Section 5. 

2 Background Materials 

In this section, we introduce necessary background materials for Multi- 
objective Optimization, ANNs, Differential Evolution (DEs), Evolutionary 
Multi-objective, and EANN. 

2.1 Multi-objective Optimization 

Consider a Multi-Objective Optimization Problem (MOP) model as presented 
below:- 



Optimize F{x) 

subject to: [2 = {x G K^\G{x) < 0} 

Where a? is a vector of decision variables {xi,. . . ,a;„) and F{x) is a vector 
of objective functions (fi{x), . . . , fK{x)). Here fi(x ), . . . , fxix), are functions 
on i?" and O is a nonempty set in i?". The vector G{x) represents a set of 
constraints. 

In MOPs, the aim is to find the optimal solution x* G f2 which optimize F{x). 
Each objective function, fi{x), is either maximization or minimization. Without 
any loss of generality, we assume that all objectives are to be minimized for clarity 
purposes. We may note that any maximization objective can be transformed to 
a minimization one by multiplying the former by -1. 

To define the concept of non-dominated solutions in MOPs, we need to define 
two operators, ^ and and then assume two vectors, x and y. We define the 
first operator as x ^ y iS 3 Xi G x and yt G y such that Xi yi. And, 

X y y Xi G X and yi G y,Xi < yi, and x y. The operators ^ 

and can be seen as the “not equal to” and “less than or equal to” operators 
respectively, between two vectors. We can now define the concepts of local and 
global optimality in MOPs. 

Definition 1: Neighborhood or open ball The open ball {ie. a neighborhood 
centered on x* and defined by the Euclidean distance) Bs(x*) = ix G 
E"l ||a;-a:*|| < ,5}. 

Definition 2: Local efficient (non-inferior/ pareto-optimal) solution A 

vector X* G f2 is said to be a local efficient solution of MOP iff ^ x G 

(Bg(x*) n f?) such that F(x) F(x*) for some positive S. 

Definition 3: Global efficient (non-inferior/ pareto-optimal) solution 
A vector a:* S 17 is said to be a global efficient solution of MOP iff J a; G 17 
such that F(x) F(x*). 
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Definition 4: Local non-dominated solution A vector y* G F(x) is said to 
be local non-dominated solution of MOP iff its projection onto the decision 
space, a?*, is a local efficient solution of MOP. 

Definition 5: Global non-dominated solution A vector y* G F{x) is said to 
be global non-dominated solution of MOP iff its projection onto the decision 
space, a;* , is a global efficient solution of MOP. 

In this paper, the term “non-dominated solution” is used as a shortcut for 
the term “local non-dominated solution” . 

2.2 Artificial Neural Networks 

We may define an ANN by a graph: G{N, A, -tf;), where iV is a set of neurons (also 
called nodes), A denotes the connections (also called arcs or synapses) between 
the neurons, and ^|) represents the learning rule whereby neurons are able to 
adjust the strengths of their interconnections. A neuron receives its inputs (also 
called activation) from an external source or from other neurons in the network. 
It then undertakes some processing on this input and sends the result as an 
output. The underlying function of a neuron is called the activation function. 
The activation, a, is calculated as a weighted sum of the inputs to the node in 
addition to a constant value called the bias. The bias can be easily augmented 
to the input set and considered as a constant input. From herein, the following 
notations will be used for a single hidden layer MLP: 

— / and F[ are the number of input and hidden units respectively. 

— X*’ G X = (xi,X 2 , ■ ■ ■ ,x^),p = is the p*^ pattern in the input 

feature space X of dimension I, and P is the total number of patterns. 

— Without any loss of generality, Yg G Yq is the corresponding scalar of 
pattern in the hypothesis space Yq. 

— iVih and Who, are the weights connecting input unit i, i = 1 ... I, to hidden 
unit h, h = 1 ... H , and hidden unit h to the output unit o (where o is 
assumed to be 1 in this paper) respectively. 

— 0h{XP) = a{ah); ah = is the hidden unit’s 

output corresponding to the input pattern X^*, where ah is the activation of 
hidden unit h, and ct(.) is the activation function that is taken in this paper 
to be the logistic function a{z) = , with D the function’s sharpness 

or steepness and is taken to be 1 unless it is mentioned otherwise. 

— yP = cr(ao); Oo = '^h=o'^ho0hO^^) is the network output and Oo is the 
activation of output unit o corresponding to the input pattern X^. 

MLPs are in essence non-parametric regression methods which approximate 
underlying functionality in data by minimizing a risk function. The data are 
presented to the network and the risk function is approximated empirically Remp 
by summing over all data instances as follows: 

p 

Rempia) = - YJ’)2 

P^l 



( 1 ) 
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The Back-propagation algorithm (BP), developed initially by Werbos [25] and 
then independently by Rumelhart group [21], is commonly used for training 
the network. BP uses the gradient of the empirical risk function to alter the 
parameter set a until the empirical risk is minimum. BP in its simple form uses 
a single parameter, rj representing the learning rate. For a complete description 
for the derivations of this algorithm, see for example [8] . The algorithm can be 
described in the following steps:- 

1. Until termination conditions are satisfied, do 

a) for each input-output pairs, (X^, Y^), in the training set, apply the fol- 
lowing steps 

i. Inject the input pattern into the network 

ii. Calculate the output, 0?i(XP), for each hidden unit h. 

^ P 

iii. Calculate the output, Yq > for each output unit o. 

iv. for the output unit o, calculate Tq = (1 — YP){YP — YP) where Tq 

is the rate of change in the error of the output unit o. 

V. for each hidden unit h, r^ = 0^(1 — O^)whoi’o where r/j is the rate 
of change in the error of hidden unit h. 
vi. update each weight in the network using the learning rate rj as fol- 
lows: 

Wih ^ Wih -b Aw^h, Awih = rjrjaih (2) 

Who ^ Who + Awho, Awho = Wkaho (3) 



2.3 Differential Evolution 

Evolutionary algorithms [5] is a kind of global optimization techniques that use 
selection and recombination as their primary operators to tackle optimization 
problems. Differential evolution (DE) is a branch of evolutionary algorithms 
developed by Rainer Storn and Kenneth Price [24] for optimization problems 
over continuous domains. In DE, each variable is represented in the chromosome 
by a real number. The approach works as follows :- 

1. Create an initial population of potential solutions at random, where it is 
guaranteed, by some repair rules, that variables’ values are within their 
boundaries. 

2. Until termination conditions are satisfied 

a) Select at random a trail individual for replacement, an individual as the 
main parent, and two individuals as supporting parents. 

b) With some probability, called the crossover probability, each variable in 
the main parent is perturbed by adding to it a ratio, F, of the difference 
between the two values of this variable in the other two supporting par- 
ents. At least one variable must be changed. This process represents the 
crossover operator in DE. 

c) If the resultant vector is better than the trial solution, it replaces it; 
otherwise the trial solution is retained in the population. 

d) go to 2 above. 
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2.4 Evolutionary Multi-objective 

EAs for MOPs [3] can be categorized into one of three categories: plain aggregat- 
ing, population-based non-Pareto and Pareto-based approaches. The plain ag- 
gregating approach combines all the objectives into one using linear combination 
(such as in the weighted sum method, goal programming, and goal attainment). 
Therefore, each run results in a single solution and many runs are needed to 
generate the pareto frontier. In addition, the quantification of the importance 
of each objective {eg. by setting numerical weights) is needed, which is very 
difficult for most practical situations. Meanwhile, optimizing all the objectives 
simultaneously and generating a set of alternative solutions as in population- 
based approaches, offers more flexibility. 

There has been a number of methods in the literature for population-based 
non-pareto [23] and pareto [9,32,13] approaches to MOPs. More recently, we de- 
veloped the Pareto Differential Evolution (PDE) method using Differential Evo- 
lution (DE) for MOPs [1]. The PDE method outperformed all previous methods 
on five benchmark problems. 



2.5 Evolutionary Artificial Neural Networks 

Over the last two decades, research into EANN has witnessed a flourish period 
[28,27]. Yao [29] presents a thorough review to the field with over 300 references 
just in the area of EANN. This may indicate that there is an extensive need for 
finding better ways to evolve ANN. 

A major advantage to the evolutionary approach over traditional learning 
algorithms such as Back-propagation (BP) is the ability to escape a local op- 
tima. More advantages include robustness and ability to adopt in a changing 
environment. In the literature, research into EANN has been taking one of three 
approaches; evolving the weights of the network, evolving the architecture, or 
evolving both simultaneously. 

The EANN approach uses either binary representation to evolve the weight 
matrix [10,11] or real [6,7,16,19]. There is not an obvious advantage of binary 
encoding in EANN over the real. However, with real encoding, there are more 
advantages including compact and natural representation. 

The key problem (other than being trapped in a local minimum) with BP 
and other traditional training algorithms is the choice of a correct architecture 
(number of hidden nodes and connections). This problem has been tackled by 
the evolutionary approach in many studies [12,15,20,30,31]. In some of these 
studies, weights and architectures were evolved simultaneously. 

The major disadvantage to the EANN approach is it is computationally ex- 
pensive, as the evolutionary approach is normally slow. To overcome the slow 
convergence of the evolutionary approach to ANN, hybrid techniques were used 
to speed up the convergence by augmenting evolutionary algorithms with a local 
search technique {ie. memetic approach), such as BP [26]. 
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3 The MPANN Algorithm 

3.1 Representation 

In deciding on an appropriate representation, we tried to choose a represen- 
tation that can be used for other architectures without further modifications. 
Our chromosome is a class that contains one matrix and one vector p. The 
matrix Q is of dimension (/ -b O) x {H -b O). Each element ujij G 1?, is the 
weight connecting unit i with unit j, where t = 0, 1) is the input unit 
i] i = 7, -b O — 1) is the output unit {i — I); j = 0, . . . ,{H — 1) is the 
hidden unit j; and j = H, . . . ,{H + O — 1) is the output unit {j — H). This 
representation has the following two characteristics that we are not using in the 
current version but can easily be incorporated in the algorithm for future work:- 

1. It allows direct connection from each input to each output units (we allow 

more than a single output unit in our representation) . 

2. It allows recurrent connections between the output units and themselves. 

The vector p is of dimension H, where ph & p is a, binary value used to indicate 
if hidden unit h exists in the network or not; that is, it works as a switch to turn 
a hidden unit on or off. The sum, ^^^qPh, represents the actual number of 
hidden units in a network, where H is the maximum number of hidden units. 
This representation allows simultaneous training of the weights in the network 
and selecting a subset of hidden units. 

3.2 Methods 

As the name indicates in our proposed method, we have a multi-objective prob- 
lem with two objectives; one is to minimize the error and the other is to minimize 
the number of hidden units. The pareto-frontier of the tradeoff between the two 
objectives will have a set of networks with different number of hidden units 
(note the definition of pareto-optimal solutions). However, sometimes the algo- 
rithm will return two pareto-networks with the same number of hidden units. 
This will only take place when the actual number of pareto-optimal solutions 
in the population is less than 3. Because of the condition in DE of having at 
least 3 parents in each generation, if there are less than three parents, the pareto 
optimal solutions are removed from the population and the population is re- 
evaluated. For example, assume that we have only 1 pareto optimal solution in 
the population. In this case, we need another 2. The process simply starts by 
removing the pareto optimal solution from the population and finding the pareto 
optimal solutions in the remainder of the population. Those solutions dominat- 
ing the rest of the population are added to the pareto list until the number of 
pareto solutions in the list is 3. 

Our proposed method augments the original PDE [1,22] algorithm with local 
search (ze. BP) to form the memetic approach. In initial investigations, the 
algorithm was quite slow and the use of local search improved its performance. 
MPANN consists of the following steps: 
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1. Create a random initial population of potential solutions. The elements of 
the weight matrix Q are assigned random values according to a Gaussian dis- 
tribution iV(0, 1). The elements of the binary vector p are assigned the value 
1 with probability 0.5 based on a randomly generated number according to 
a uniform distribution between [0, 1]; otherwise 0. 

2. Repeat 

a) Evaluate the individuals in the population and label those who are non- 
dominated. 

b) If the number of non-dominated individuals is less than 3 repeat the 
following until the number of non-dominated individuals is greater than 
or equal to 3:- 

i. Find a non-dominated solution among those who are not labelled. 

ii. Label the solution as non-dominated. 

c) Delete all dominated solutions from the population. 

d) Mark 20% of the training set as a validation set for BP. 

e) Repeat 

i. Select at random an individual as the main parent a\, and two in- 
dividuals, a 2 ,o ;3 as supporting parents. 

ii. With some crossover probability Uniform{0, 1), do 

+ GausszaniO, (4) 




otherwise 



, ,child 
^ho 



(9) 



where each weight in the main parent is perturbed by adding to it a 
ratio, F G Gaussian{Q, 1), of the difference between the two values 
of this variable in the two supporting parents. At least one variable 
must be changed. 

iii. Apply BP to the child. 

iv. If the child dominates the main parent, place it into the population, 

f) Until the population size is M 

3. Until termination conditions are satisfied, go to 2 above. 
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One may note that before each generation starts, 20% of the instances in 
the training set are marked as a validation set for the use of BP; that is, BP 
will use 80% of the original training set for training and 20% for validation. 
Also, the termination condition in our experiments is the maximum number of 
epochs is reached; where one epoch is equivalent to one pass through the training 
set. Therefore, one iteration of BP is equivalent to one epoch since 80% of the 
training set is used for training and the other 20% for validation; that is, one 
complete pass through the original training set. After the network is trained, the 
chromosome changes to reflect the new weight sets. 



4 Experiments 

4.1 Data Sets 

We have tested MPANN on two benchmark data sets; the Australian credit card 
assessment problem and the diabetes problem. Both data sets are available by 
anonymous ftp from ice.uci.edu [2]. The following is a brief description of each 
data set. 

— The Australian Credit Card Assessment Data Set 

This data set contains 690 patterns with 14 attributes; 6 of them are numeric 
and 8 discrete (with 2 to 14 possible values). The predicted class is binary - 
1 for awarding the credit and 0 for not. The problem is to assess applications 
for credit cards [17]. 

— The Diabetes Data Set 

This data set has 768 patterns; 500 belonging to the first class and 268 to 
the second. It contains 8 attributes. The objective is to test if a patient 
has a diabetes or not. The classification problem is difficult as the class 
value is a binarized form of another attribute that is highly indicative of a 
certain type of diabetes without having a one-to-one correspondence with 
the medical condition of being diabetic [17]. 



4.2 Experimental Setup 

To be consistent with the literature [17], the Australian credit card assessment 
data set is divided into 10 folds and the Diabetes data set into 12 folds where 
class distribution is maintained in each fold. One-leave-out cross-validation is 
used where we run the algorithm with 9 (11) out of the 10 (12) folds for each 
data set then we test with the remaining one. We vary the crossover probability 
between 0 to 1 with an increment of 0.1. The maximum number of epochs is set 
to 2000, the population size 25, the learning rate for BP 0.003, the maximum 
number of hidden units is set to 10, and the number of epochs for BP is set to 
5. 
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4.3 Results 

The average of the pareto networks with the best generalization and the corre- 
sponding number of hidden units in each fold are being calculated along with the 
standard deviations as shown in Table 1. It is interesting to see the small stan- 
dard deviations for the test error in both data sets, which indicates consistency 
and stability of the method. 



Table 1. The average and standard deviations of the pareto network with the best 
generalization (smallest test error) in each run 



Data set Error Number of hidden units 

Australian Credit Card 0.136 ± 0.045 5.000 ± 1.943 

Diabetes 0.251 ± 0.062 6.6 ± 1.505 



In Figure 1, the average test and training errors corresponding to the best 
generalized network in each fold is plotted against each of the eleventh crossover 
probabilities. In Figure 1 (left), with crossover 0.1 and upward, the test error 
is always smaller than the training error, which indicates better generalization. 
However, the degree of this generalization varied across the different crossover 
probabilities. The best performance occurs with crossover probability 0.3, which 
indicates that 30% of the weights, on the average, in each parent change. This is 
quite important as it entails that the building blocks in MPANN is effective; oth- 
erwise a better performance would have occurred with the maximum crossover 
probability. We may note here that crossover in DE is in effect a guided muta- 
tion operator. In Figure 1 (right), it is also apparent that an average crossover 
probability of 0.8 resulted in the best generalization ability. Very high or low 
crossover probabilities are not as good. 

In summary, the best performances for the Australian credit card and Dia- 
betes data sets are 0.136 ± 0.045 and 0.251 ± 0.062 respectively and occur with 
crossover probabilities 0.3 and 0.8 respectively. 



4.4 Comparisons and Discussions 

We compare our results against 23 algorithms tested by Michie et al. [17]. These 
algorithms can be categorized into decision trees (CART, IndCART, NewID, 
AC^, Baytree, Cal5, and C4.5), rule-based methods (CN2, and ITrule), neu- 
ral networks (Backprob, Kohonen, LVQ, RBF, and DIPOL92), and statistical 
algorithms (Discrim, Quadisc, Logdisc, SMART, ALLOC80, k-NN, CASTLE, 
NaiveBay, and Default). For a complete description of these algorithms, the 
reader may refer to [17]. 

In Tables 2 and 3, we find that MPANN is equivalent or better than BP and 
comparable to the others. However, we notice here that MPANN also optimized 
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Fig. 1. The average training and test error for the Australian Credit Card (on the left) 
and Diabetes data sets (on the right) obtained by each crossover probability. 

Table 2. Comparing MPANN against 23 traditional methods in terms of the average 
generalization error for the Australian Credit Card data set. 



Algorithm Error Rate Algorithm Error Rate Algorithm Error Rate Algorithm Error Rate 



MPANN 


0.136 


CASTLE 


0.148 


NaiveBay 


0.151 


Default 


0.440 


CART 


0.145 


IndCART 


0.152 


NewID 


0.181 


AC^ 


0.181 


Baytree 


0.171 


Cal5 


0.131 


C4.5 


0.155 


CN2 


0.204 


ITrule 


0.137 


Backprob 


0.154 


Kohonen 


Fail 


LVQ 


0.197 


RBF 


0.145 


DIPOL92 


0.141 


Discrim 


0.141 


Quadisc 


0.207 


Logdisc 


0.141 


SMART 


0.158 


ALLOC80 


0.201 


k-NN 


0.181 



its architecture while optimizing its generalization ability. Therefore, in terms of 
the amount of computations, it is by far faster than BP as we simultaneously 
optimize the architecture and generalization error. In addition, the total number 
of epochs used is small compared to the corresponding number of epochs needed 
by BP. 



5 Conclusion 



In this paper, we presented a new evolutionary multi-objective approach to 
artificial neural networks. We showed empirically that the proposed approach 
outperformed traditional Back-propagation and had comparable results to 23 
classification algorithms. For future work, we will evaluate the performance of 
the proposed method on regression problems and test the scalability of the evo- 
lutionary approach. 



Acknowledgement. The author would like to thank Xin Yao, Bob Mckay, and 
Ruhul barker for their insightful comments while discussing an initial idea with 






A Memetic Pareto Evolutionary Approach to Artificial Neural Networks 



11 



Table 3. Comparing MPANN against 23 traditional methods in terms of the average 
generalization error for the Diabetes data set. 



Algorithm Error Rate Algorithm Error Rate Algorithm Error Rate Algorithm Error Rate 



MPANN 


0.251 


CASTLE 


0.258 


NaiveBay 


0.262 


Default 


0.350 


CART 


0.255 


IndCART 


0.271 


NewlD 


0.289 


AC^ 


0.276 


Baytree 


0.271 


Cal5 


0.250 


C4.5 


0.270 


CN2 


0.289 


ITrule 


0.245 


Backprob 


0.248 


Kohonen 


0.273 


LVQ 


0.272 


RBF 


0.243 


D1POL92 


0.224 


Discrim 


0.225 


Quadisc 


0.262 


Logdisc 


0.223 


SMART 


0.232 


ALLOC80 


0.301 


k-NN 


0.324 



them. This work is supported with ADFA Special Research Grants TERM6 2001 
DOD02 ZOOM Z2844. 
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Abstract. Defeasible reasoning is a simple but efficient approach to nonmono- 
tonic reasoning that has recently attracted considerable interest and that has found 
various applications. Defeasible logic and its variants are an important family of 
defeasible reasoning methods. So far no relationship has been established be- 
tween defeasible logic and mainstream nonmonotonic reasoning approaches. 

In this paper we will compare an ambiguity propagating defeasible logic with 
default logic. In fact the two logics take rather contrary approaches: defeasible 
logic takes a directly deductive approach, whereas default logic is based on alter- 
native possible world views, called extensions. Computational complexity results 
suggest that default logics are more expressive than defeasible logics. This paper 
answers the opposite direction: an ambiguity propagating defeasible logic can be 
directly embedded into default logic. 



1 Introduction 

Defeasible reasoning is a nonmonotonic reasoning [11] approach in which the gaps due 
to incomplete information are closed through the use of defeasible rules that are usu- 
ally appropriate. Defeasible logics were introduced and developed by Nute over several 
years [13]. These logics perform defeasible reasoning, where a conclusion supported by 
a rule might be overturned by the effect of another rule. Roughly, a proposition p can 
be defeasibly proved only when a rule supports it, and it has been demonstrated that 
no rule supports ~^p. These logics also have a monotonic reasoning component, and a 
priority on rules. One advantage of Nute’s design was that it was aimed at supporting 
efficient reasoning, and in our work we follow that philosophy. 

This family of approaches has recently attracted considerable interest. Apart from 
implementability, its use in various application domains has been advocated, including 
the modelling of regulations and business rules [12, 8, 2], modelling of contracts [15], 
legal reasoning [14] and electronic commerce [7]. 

An interesting question is the relationship to more mainstream nonmonotonic ap- 
proaches such as default logic [16]. [10] shows that defeasible logic has linear com- 
plexity. In contrast to that the complexity of default logic is known to be high even in 
simple cases [9, 6]. Therefore we cannot expect default logic to be naturally embedded 
in defeasible logics (under the natural representation of normal defaults as defeasible 
rules). 

The opposite question, that is whether defeasible logics can be embedded into de- 
fault logics, is answered in this paper. This result cannot be expected for the “standard” 
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defeasible logic [4]. It is easily seen that defeasible logic is ambiguity blocking [17], 
while default logic propagates ambiguity. 

Recently a family of defeasible logics [3] was introduced, among them an ambi- 
guity propagating defeasible logic. In this paper we show that this logic can be fully 
embedded in default logic under the natural representation of defeasible rules as nor- 
mal default rules, and strict rules as defaults without justifications. If 77 is a defeasible 
theory and T{D) is its translation into a default theory, this paper shows that if a literal 
is defeasibly provable in 77, then it is sceptically provable in T(77) (that is, it is included 
in all extensions of T(77)). 

The establishment of relationships between different approaches is important: each 
approach may benefit from work done on the other; the combination of strengths can 
lead to new, better approaches; and the assimilation of knowledge is supported. Based 
on the results of this paper, defeasible logic can be viewed as an efficient approximation 
of default logic for certain classes of default theories. 



2 Defeasible Logic 

2.1 A Language for Defeasible Reasoning 

A defeasible theory (a knowledge base in defeasible logic) consists of three different 
kinds of knowledge: strict rules, defeasible rules, and a superiority relation. (Fuller 
versions of defeasible logic also have facts and defeaters, but [4] shows that they can be 
simulated by the other ingredients). 

Strict rules are rules in the classical sense: whenever the premises are indisputable (e.g. 
facts) then so is the conclusion. An example of a strict rule is “Emus are birds”. Written 
formally: 

emu{X) hird{X). 

Defeasible rules are rules that can be defeated by contrary evidence. An example of 
such a rule is “Birds typically fly”; written formally: 

hird(X) => flies(X). 

The idea is that if we know that something is a bird, then we may conclude that it flies, 
unless there is other, not inferior, evidence suggesting that it may not fly. 

The superiority relation among rules is used to define priorities among rules, that is, 
where one rule may override the conclusion of another rule. For example, given the 
defeasible rules 

r : bird[X) => flies[X) 

r' : hrokenWing(X) => ~<flies[X) 

which contradict one another, no conclusive decision can be made about whether a bird 
with broken wings can fly. But if we introduce a superiority relation > with r ' > r , with 
the intended meaning that r' is strictly stronger than r, then we can indeed conclude that 
the bird cannot fly. 
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It is worth noting that, in defeasible logic, priorities are local in the following sense: 
Two rules are considered to be competing with one another only if they have comple- 
mentary heads. Thus, since the superiority relation is used to resolve conflicts among 
competing rules, it is only used to compare rules with complementary heads; the infor- 
mation r > r' for rules r, r' without complementary heads may be part of the superior- 
ity relation, but has no effect on the proof theory. 



2.2 Formal Definition 

In this paper we restrict attention to essentially propositional defeasible logic. Rules 
with free variables are interpreted as rule schemas, that is, as the set of all ground 
instances; in such cases we assume that the Herbrand universe is finite. We assume that 
the reader is familiar with the notation and basic notions of propositional logic. If g is a 
literal, ~ q denotes the complementary literal (if g is a positive literal p then ~ g is -ip; 
and if g is -ip, then ~ g is p). 

Rules are defined over a language (or signature) U, the set of propositions (atoms) 
and labels that may be used in the rule. 

A rule r : A[r) C{r) consists of its unique label r, its antecedent A[r) (A[r) 
may be omitted if it is the empty set) which is a finite set of literals, an arrow (which 
is a placeholder for concrete arrows to be introduced in a moment), and its head (or 
consequent) C{r) which is a literal. In writing rules we omit set notation for antecedents 
and sometimes we omit the label when it is not relevant for the context. There are two 
kinds of rules, each represented by a different arrow. Strict rules use — ;> and defeasible 
rules use =>. 

Given a set R of rules, we denote the set of all strict rules in i? by and the set of 
defeasible rules in R by Rd- R[q] denotes the set of rules in R with consequent g. 

A superiority relation on R is a relation > on R. When ri > r2, then ri is called 
superior to r2, and V2 inferior to ri . Intuitively, ri > r2 expresses that ri overrules r2, 
should both rules be applicable. > must be acyclic (that is, its transitive closure must 
be irreflexive). 

A defeasible theory is a pair (R, >) where R a finite set of rules, and > a supe- 
riority relation on R. 



2.3 An Ambiguity Propagating Defeasible Logic 

Here we discuss a defeasible logic that was first introduced in [ 3 ]. It is a logic that 
propagates ambiguity. A preference for ambiguity blocking or ambiguity propagating 
behaviour is one of the properties of non-monotonic inheritance nets over which intu- 
itions can clash [ 17 ]. Ambiguity propagation results in fewer conclusions being drawn, 
which might make it preferable when the cost of an incorrect conclusion is high. 

A conclusion of a defeasible theory D is a tagged literal. A conclusion has one of 
the following six forms: 

- +Aq, which is intended to mean that the literal g is definitely provable, using only 
strict rules. 
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- —Aq, which is intended to mean that q is provably not strictly provable (finite 
failure). 

- +dq, which is intended to mean that q is defeasibly provable in D. 

- —dq which is intended to mean that we have proved that q is not defeasibly provable 
iwD. 

- + f q, which is supposed to mean that q is supported (what this means will be 
explained soon). 

- — J q, which is supposed to mean that q is provably not supported. 

Provability is defined below. It is based on the concept of a derivation (or proof) 
in D = [R, >). A derivation is a finite sequence P = P{1), ■ ■ ■ , P(n) of tagged lit- 
erals satisfying the following conditions. The conditions are essentially inference rules 
phrased as conditions on proofs. denotes the initial part of the sequence P of 

length i. 

+A: If P{i -|- 1) = +Aq then 

3r G Rs[q] Va G A{r) : +Aa G P{l..i) 

That means, to prove +Aq we need to establish a proof for q using strict rules 
only. This is a deduction in the classical sense - no proofs for the negation of q need 
to be considered (in contrast to defeasible provability below, where opposing chains of 
reasoning must be taken into account, too). 

—A: If P{i -|- 1) = —Aq then 

Vr G Rs[q] 3a G A{r) : —Aa G P{l..i) 

The definition of —A is the so-called strong negation of +A. —Aq {—dq] means 
that we have a proof that +Aq [+9g] cannot be proved. 

-\-d'. If P{i -f 1) = -\-dq then either 

(1) -\-Aq G or 

(2) (2.1) 3r G P[q] Va G A{r) : +da G and 

(2.2) -A r^qe and 

(2.3) Vs G i?[~ q] either 

(2.3.1) 3a G ^(s) : — / a G or 

(2.3.2) 3t G P[q] such that 

Va G A{t) : +da G P{l..i) andf > s 

—d: If P(i -f 1) = —dq then 

(1) —Aq G P{l..i) and 

(2) (2.1) Vr G P[q] 3a G A(r) : —da G or 

(2.2) + A r^qe P{l..i) or 

(2.3) 3s G i?[~ q] such that 

(2.3.1) Va G ^(s) : +f a G P(l..i) and 

(2.3.2) Vt G P[q] either 

3a G A(t) : —da G P(l..i) or 
notf > s 
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Let us explain this definition. To show that q is provable defeasibly we have two 
choices: (1) We show that q is already definitely provable; or (2) we need to argue using 
the defeasible part of D as well. In particular, we require that there must be a strict or 
defeasible rule with head q which can be applied (2.1). But now we need to consider 
possible “counterattacks”, that is, reasoning chains in support of ~ g. To be more spe- 
cific: to prove q defeasibly we must show that ~ g is not dehnitely provable (2.2). Also 
(2.3) we must consider the set of all rules which are not known to be inapplicable and 
which have head ~ g. Essentially each such rule s attacks the conclusion g. For g to 
be provable, each such rule s must be counterattacked by a rule t with head g with the 
following properties: (i) t must be applicable at this point, and (ii) t must be stronger 
than (i.e. superior to) s. Thus each attack on the conclusion g must be counterattacked 
by a stronger rule. 

The only issue we did not discuss was when the attacking rules s should be dis- 
regarded because they are inapplicable. One way is to ignore a rule s if one of its 
antecedants is not defeasibly provable. However this approach leads to the blocking 
of ambiguity, as shown in [3]. To propagate ambiguity we make attacks on potential 
conclusions easier, or stated another way, we make it more difficult for attacking rules 
s to be ignored. This will only happen if at least one of the antecedents is not even 
supported. 

Next we define the inference conditions for support. 

-f /: If P(i -f 1) = -f / g then either 
-\-Aq £ or 

3r £ R[q] such that 

Va E A(r) : + f a E P(l..i), and 
Vs E g] either 

3a E 7l(s) : —da E P(l..i) or 
not s > r 

— J: If P(i + 1) = — f q then 
—Aq E P(l..i) and 
Vr E R[q] either 

3a E A(r) : — f a E P(l..i), or 
3 s E g] such that 

Va E 7l(s) : +da E P{l..i) and 
s > r 

The elements of a derivation are called lines of the derivation. We say that a tagged 
literal L is provable in D = (R,>), denoted by h L, iff there is a derivation in D 
such that L is a line of P . 

Example 1. Consider the defeasible theory 

=> p 

=> -'p 

^ q 

p^^q 
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Neither p nor ~^p is defeasibly provable, however they are both supported. In an ambi- 
guity blocking defeasible logic the last rule would be disregarded because we can prove 
—dp. However, in the definition we just gave, although the last rule is not applicable, 
its prerequisite is supported, thus the rule has to be counterattacked if we wish to derive 
+dq. However the superiority relation is empty, so no counterattack is possible and we 
can derive —dq.ln this example no positive defeasible conclusions can be drawn. 



3 Default Logic with Priorities 

A default S has the form closed formulae <p, fi, . . .,'tpn , X- is the pre- 

requisite pre[S), ipi, . . . ,ipn the justifications ju.st{S), and x the consequent cons(S) 
of S. A default is called normal if just (S) = {consjd)}. 

A default theory T is a pair (W, Def) consisting of a set of formulae W (the set of 
facts) and a countable set Def of defaults. 

Let ^ ^ default, and E a deductively closed set of formulae. We say 

that d is applicable to E iff p ^ E, and -iip i , . . . , ^ E. 

Let iT = (do, di, do, . . .) be a finite or infinite sequence of defaults from Def 
without multiple occurrences (modelling an application order of defaults from Def). 
We denote by II [k] the initial segment of II of length k, provided the length of II is at 
least k. 

- In(n) = Th(W U {cons(d) | d occurs in II}), where Th denotes the deductive 

closure. 

- Out(n) = {-1^ I Ip G just[S), d occurs in II}. 

n is called a process ofT iff d^ is applicable to In(II[k]), for every k such that d^ 
occurs in II . II is successful iff In[II) n Out(n) = 0, otherwise it is failed. II is 
closed iff every default that is applicable to In[II) already occurs in II . For normal 
default theories all processes are successful. 

[1] shows that Reiter’s original definition [16] of extensions is equivalent to the 
following one: A set of formulae E is an extension of a default theory T iff there is a 
closed and successful process II of T such that E = In[II). 

Now we consider the addition of priorities to default logic. We will concentrate on 
static priorities, and will adopt a presentation similar to that of [5]. 

A prioritized default theory is a triple T = (W, Def, >) where IF is a set of facts, 
Def a countable set of defaults, and > an acyclic relation on Def. 

Consider a total order on Def that expands > (in the sense that it contains more 
pairs). We define a process = (do, di, . . .) as follows: d,- is the largest default 
in Def — II[{\ that is applicable to In(II[{\) (slightly abusing notation we have used 
processes here as sets). Note that, by definition, is a closed process because the 
selection of the next default is fair (see [1]). 

A set of formulas E is called an extension of T iff E = In(n^), where is a 
total order on Def that extends >, and for which the process is successful. 

This definition extends definitions such as [5] which usually assume that all defaults 
are normal. The definition can be viewed as a two-step construction: 
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- Compute all extensions, disregarding priorities. 

- Filter out those extensions that can only he obtained using a total order that violates 

>. 



4 Connections 



First we define the natural translation of a defeasible theory D = {R,>) into default 
logic. We define a prioritized default theory T(I9) = (0, def{r) , >d) as follows. 

A defeasible rule r 

{Pl, ■ ■■,Pn} ^P 

is translated into the following default: 



defd(r) 



Pl A . . .Apn -p 
P 



def{r) = {defd{r)}. This is the natural translation of a defeasible rule into a normal 
default. 

It would appear natural to represent a strict rule r 



{pi, . ..,Pn} -A p 



as the default 



Pl A...Apn : 



P 

However this translation does not work as the following example demonstrates. 



Example 2. Consider the defeasible theory consisting of the rules 



=> p 

p^ q 
=> -'9 

Here p is defeasibly provable, but neither q nor ~^q. There is one rule to support each 
conclusion, but there are no priorities to resolve the conflict (and strict rules are not 
deemed to be superior to defeasible rules). However, in the translation into default logic 

true : p p : true : ~<q 

p q ^q 

there is only one extension, Th({p, g}), so q is sceptically provable. 

A close analysis of the inference conditions in defeasible logic reveals that strict 
rules play a dual role: on one hand they can be combined with other strict rules to prove 
literals strictly. On the other hand they may combined with other strict and at least one 
defeasible rule to prove literals defeasibly, but then strict rules are treated exactly like 
defeasible rules. 

This point is analysed in [4]. There it is shown that every defeasible theory can be 
equivalently transformed into one where the two roles are separated. We could then 
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apply the natural translation into default logic, as outlined above, and get the desired 
result. Instead, in this paper we take a slightly different path: We maintain the generality 
of defeasible theories, but make the translation slightly more complicated. 

A strict rule r, as above, leads to three different defaults: 



defs[r) = 



p[ A... A p'„: 



p, 



p' 

new[p) = 

p 



defd(r) 



Pi A ... A pn :p 

p 



Hereby ' is an operator that generates new, pairwise distinct names. We have def{r) = 
{defd{r),defs{r),new{p)}. 

Finally we define 



defd{r) >d defd{s) r > s. 

There is no priority information regarding the defs{.) and new(.) defaults. 
Example 2 (continued) We reconsider the defeasible theory D 

=> p 

p^ q 
=> -'9 

The translation into default logic T(D) consists of the defaults: 
true : p p '■ q p' '■ q' '■ true : ~<q 

p q q' q ~^q 

There are two extensions, Th({p, q}) and Th{{p, -ig}), so only p is sceptically prov- 
able in T(D). This outcome corresponds to p being the only literal that is defeasibly 
provable in D. 

Lemma 1. 

(a) If D h +Zip then p' G E for all extensions E ofT(D). 

(b) If D h —Ap then p' (f E, for all extensions E ofT{D). 

Proof: The proof goes by induction on the length of a derivation in defeasible logic. 
We only show (a) here, (b) can be proven in the same way. 

Consider a proof P, and suppose P(i -|- 1) = -\-Aq. Then, by definition, there is a 
rule r G Rs[q] such that for all a G A{r), +Aa G P{l..i). 

By induction hypothesis, a' is included in all extensions E of T(D), for all a G 
A{r). Consider an arbitrary E = In(II^) for a total order that includes > d and 
generates a successful U^. Then a' is included in In(II^ ), for all a G A{r). But then 
the prerequisite of defs{r) is in In(II^), so defs{r) is applicable to In(II^), and 
defs{r) occurs in because is closed. Thus the consequent of defs{r), p' , is 
included in /n(iT^). □ 
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Lemma 2. 

(a) If D \- — f p then p E , for all extensions E ofT(D). 

(b) If D h -\-dp then p G E for all extensions E ofT(D). 

Proof: We use simultaneous induction for (a) and (b) on the length of a derivation 

P. 

(a): Let P{i + 1) = — J p. 

Suppose there is a total order on the defaults of T(_D) which extends > d, gener- 
ates a successful process iT^, andp £ In(II^ ). We will show that these assumptions 
lead to a contradiction. 

First we note thatp E In(II^) and that there are neither facts nor disjunctions in 
T(D). Therefore there must be a default in with consequent p. 

One possibility is that a default new(p) occurs in iT^. However, by the — J con- 
dition, we have —Ap E P(\..i). Then, by Lemma 1, p' E for all extensions E of 
T(D). So p' ^ In(n^). But then new(p) is not applicable to In(II^), so it can’t 
occur in II ^ , which gives us a contradiction. 

The other possibility is that defd (r) occurs in , for some rule r E R[p]- Consider 
the first such default that appears in II and suppose it occurs in the -f 1st position 
(that is, defd(r) is applicable to In{II^ [^]))- Then defd{r) ^ S for all defaults S that 
are applicable to In{II^[k]) and not yet in II ^[k]. That means, because extends 
>d, that 



defd(r) f defd{s) (*) 

for all rules s E R[^p] such that defd(s) is applicable to In{II^ [fc]). 

However, from the condition — f and the assumption P{i + 1) = — J p, we know 
that either there is a £ A{r) such that — f a E P(l..i). Then, by induction hypoth- 
esis, a ^ In(n^), so pre{defd{r)) ^ In(II^), which gives a contradiction to the 
assumption that defd(r) occurs in the process . 

The other case of the — J condition is that there is a rule s E R[^p] such that s > r 
and +da E P{l..i), for all a E 2l(s). By definition, we have 

defd{s) > defd{r). 

Moreover, by induction hypothesis (part (b)), a E In(II^), for all a E 2l(s). Thus 

pre{defd{s)) C In{II^). 

By definition of a derivation in defeasible logic, the derivation of an antecedent 
a cannot depend on p (otherwise the derivation of a and p would fail due to loop- 
ing). By construction of T{D), that means that pre{defd{s)) E In{II^ [fc]). Moreover 
In(n^ [fc]) is consistent with the justification of def d(s), namely ~p, because defd{r) 
was assumed to be the first default with consequent p that occurs in R^. 

Thus defd(s) has been established to be applicable to /n(iT^[fc]), and defd(s) > 
defd(r) . So either defd (s) already occurs in [fc] , which contradicts the applicability 
of defd{r) (pre{defd{r)) =~ just{defd{s)))', or defd{s) does not occur in II^[k], 
which contradicts (*). 

Part (b) is proven in a similar way. □ 
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It is worth noting that a similar result does not hold for + f and —d. For example, 
one might think that if + f p is provable then p is included in at least one extension. 
That is not the case. 

Example 3. Consider the theory 

=> p 
=> ~^p 

{p, ~^p} => q 

Here q is supported because both p and ~^p are supported, but q is not included in any 
extension of the corresponding default theory (the prerequisite pA^pof the third default 
cannot be proved). 

The following theorem summarizes the main result. 

Theorem 1. If a literal p is defeasibly provable in D, then p is included in all extensions 
ofT{D). 

The converse is not true, as the following example shows. 

Example 4. Consider the defeasible theory 

=> p 
=> ~^p 

p^ q 
~^p^ q 

In defeasible logic, q is not defeasibly provable because neither p nor ~^p are defeasibly 
provable. However, the default logic translation 

true : p true : ~<p p : q ~<p : q 
P ~^P q q 

has two extensions, Th({p, q}) and Th{{^p, g}), so q is included in all extensions. 

Example 5. Consider the defeasible theory 

n : ^ p 
r-2 p 
rs ■ ^ ^P 

T4 : => -ip 
ri > rs 
r-2 > T4 
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p is defeasibly provable. Now consider the translation into default logic. 

true : p , j. . . true : p 

aefd(ri) = defd(r2) = 

p p 

true : -ip , „ . x true : ~^p 

aefd(r3) = defd(r4) = 

-ip -ip 

defd{ri) >d defdira) defd{r2) >d defdird) 

Six total orders are possible which do not violate the relation > d '■ 

defdiri) » defd{r2) > defdira) » defdird) 
defdiri) » defdira) » defd{r2) > defdird) 
defdiri) » defdira) » defdird) > defdir2) 
defdira) » defdird) > defdiri) » defdir 2 ) 
defdira) » defdiri) » defdird) > defdir 2 ) 
defdira) » defdiri) » defdir 2 ) > defdird) 

It is easy to see that each such arrangement leads to an extension T/i({p}). 

Example 1 (continued) 

The translation into default logic consists of the defaults 

true : p true : ~<p true : rj p '■ ~'Q 

p -np q 

There are three extensions, Thi{p, g}), Thi{p, -ig}) and T/i({-ip, g}). Thus none of 
p, -ip, g, -ig is included in all extensions. This outcome is consistent with our previous 
result that the original defeasible theory does not have any positive defeasible conclu- 
sion. This example demonstrates the ambiguity propagating nature of default logic, and 
justifies our selection of an ambiguity propagating defeasible logic to conduct the com- 
parison. 



5 Conclusions 

This paper established for the first time a relationship between default logic and a de- 
feasible logic. In particular, it showed how an ambiguity propagating defeasible logic 
can be embedded into default logic. Based on our results defeasible logic can be viewed 
as an efficient approximation of classes of default theories. 
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Abstract. We consider the well-studied Pattern Recognition (PR) prob- 
lem of designing linear classifiers. When dealing with normally distribut- 
ed classes, it is well known that the optimal Bayes classifier is linear 
only when the covariance matrices are equal. This was the only known 
condition for discriminant linearity. In a previous work, we presented 
the theoretical framework for optimal pairwise linear classifiers for two- 
dimensional normally distributed random vectors. We derived the nec- 
essary and sufficient conditions that the distributions have to satisfy so 
as to yield the optimal linear classifier as a pair of straight lines. 

In this paper we extend the previous work to d-dimensional normal- 
ly distributed random vectors. We provide the necessary and sufficient 
conditions needed so that the optimal Bayes classifier is a pair of hy- 
perplanes. Various scenarios have been considered including one which 
resolves the multi-dimensional Minsky ’s paradox for the perceptron. We 
have also provided some three dimensional examples for all the cases, 
and tested the classification accuracy of the relevant pairwise linear clas- 
sifier that we found. In all the cases, these linear classifiers achieve very 
good performance. 



1 Introduction 

The problem of finding linear classifiers lias been the study of many researchers 
in the field of Pattern Recognition (PR) . Linear classifiers are very important 
because of their simplicity when it concerns implementation, and their classifi- 
cation speed. Various schemes to yield linear classifiers are reported in the liter- 
ature such as Fisher’s approach [2,7,18], the perceptron algorithm (the basis of 
the back propagation neural network learning algorithms) [6,8, 11, 12], piecewise 
recognition models [9], random search optimization [10], and removal classifica- 
tion structures [1]. All of these approaches suffer from the lack of optimality, and 
thus, although they do determine linear discriminant functions, the classifier is 
not optimal. 

M. Brooks, D. Corbett, and M. Stumptiier (Eds.): AI 2001, LNAI 2256, pp. 25-36, 2001. 

© Springer-Verlag Berlin Heidelberg 2001 
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Apart from the results reported in [16,17], in statistical PR, the Bayesian 
linear classification for normally distributed classes involves a single case. This 
traditional case is when the covariance matrices are equal [4,13]. In this case, 
the classifier is a single straight line (or a hyperplane in the d-dimensional case) 
completely specified by a first-order equation. 

In [16, 17], we showed that although the general classifier for two dimensional 
normally distributed random vectors is a second-degree polynomial, this polyno- 
mial degenerates to be either a single straight line or a pair of straight lines. Thus, 
as opposed to the traditional results, we showed that the classifier can be linear 
even when the covariance matrices are not equal. In this case, the discriminant 
function is a pair of first-order equations, which are factors of the second-order 
polynomial (i.e. the discriminant function). When the factors are equal, the dis- 
criminant function is given by a single straight line, which corresponds to the 
traditional case when the covariance matrices are equal. 

In this paper, we extend these conditions for d-dimensional normal random 
vectors, where d > 2. We assume that the features of an object to be rec- 
ognized are represented as a d-dimensional vector which is an ordered tuple 
X = [a;i . . . Xd]'^ characterized by a probability distribution function. We deal 
only with the case in which these random vectors have a jointly normal distri- 
bution, where class Wj has a mean Mj and covariance matrix Si, i = 1,2. 

Without loss of generality, we assume that the classes u>i and W 2 have the 
same a priori probability, 0.5, in which case, the discriminant function is given 
by: 



log 114 - {X - MifS^\X - Ml) + {X- M 2 fS^\X - M 2 ) = 0 . (1) 

l-bll 

When Si — S 2 , the discriminant function is linear [3, 19]. For the case when 
Si and S 2 are arbitrary, the classifier results in a general equation of second 
degree which results in the discriminant being a hyperparaboloid, a hyperellip- 
soid, a hypersphere, a hyperboloid, or a pair of hyperplanes. This latter case is 
the focus of our present study. 

The results presented here have been rigorously tested. In particular, we 
present some empirical results for the cases in which the optimal Bayes classi- 
fier is a pair of hyperplanes. It is worth mentioning that we tested the case of 
Minsky’s paradox on randomly generated samples, and we have found that the 
accuracy is very high even though the classes are significantly overlapping. 

The formal proof of a few theorems are omitted in the interest of brevity. 
They are found in the unabridged version of the paper [14] and in [15], and can 
be made available to the reader. 

2 Linear Discriminants for Diagonalized Classes; The 2-D 
Case 

The concept of diagonalization is quite fundamental to our study. Diagonaliza- 
tion is the process of transforming a space by performing linear and whitening 
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transformations [3]. Consider a normally distributed random vector, X, with any 
mean vector and covariance matrix. By performing diagonalization, X can be 
transformed into another normally distributed random vector, Z, whose covari- 
ance is the identity matrix. This can be easily generalized to incorporate what is 
called “simultaneous diagonalization” . By performing this process, two normal- 
ly distributed random vectors, Xi and X 2 , can be transformed into two other 
normally distributed random vectors, Zi and Z 2 , whose covariance matrices are 
the identity and a diagonal matrix, respectively. A more in-depth discussion of 
diagonalization can be found in [3, 18], and is omitted here as it is assumed to 
be fairly elementary. We discuss below the conditions for the mean vectors and 
covariance matrices of simultaneously diagonalized vectors in which the Bayes 
optimal classifier is pairwise linear. 

In [16,17], we presented the necessary and sufficient conditions required so 
that the optimal classifier is a pair of straight lines, for the two dimensional 
space. Using these results, we present here the cases for the d-dimensional case 
in which the optimal Bayes classifier is a pair of hyperplanes. 

Since we repeatedly refer to the work of [16, 17], we state (without proof) the 
relevant results below. 

One of the cases in which we evaluated the possibility of finding a pair of 
straight lines as the optimal classifier is when we have inequality constraints. 
This case is discussed below. 

Theorem 1. Let Xi ~ N{Mi,Si) and X 2 ~ N{M 2 ,H 2 ) be two normally dis- 
tributed random vectors with parameters of the form: 



r 

s 


, M 2 = 


—r 

—s 


, Si = 


' 0 
1 -* 0 


, and S 2 = 


0 ■ 
0 b~\ 



There exist real numbers, r and s, such that the optimal Bayes classifier is a 
pair of straight lines if one of the following conditions is satisfied: 

(a) 0 < a < 1 and b> I , 

(b) a > 1 and 0 < 6 < 1 . 

Moreover, if 

0(1 — b)r^ -\- 6(1 — a)s^ — ^(a6 — a — 6 -H 1) logah = 0 , (3) 

the optimal Bayes classifier is a pair of straight lines. □ 

Another case evaluated in [16, 17] is when we have equality constraints. In this 
case, the optimal Bayes classifier is a pair of parallel straight lines. In particular, 
when Si = B 2 , these lines are coincident. 

Theorem 2. Let Xi ~ N{Mi,S\) and X 2 ~ N{M 2 ,S 2 ) be two normally dis- 
tributed random vectors with parameters of the form of (2). The optimal Bayes 
classifier is a pair of straight lines if one of the following conditions is satisfied: 
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(a) a — 1, b ^ 1, and r = 0 , 

(b) a ^ 1, b = 1, and s = 0 . 

The classifier is a single straight line if: 

(c) a = landb=:l. □ 

3 Multi - dimensional Pairwise Hyperplane Discriminants 

Let us consider now the more general case for d > 2. Using the results mentioned 
above, we derive the necessary and sufficient conditions for a pairwise-linear 
optimal Bayes classifier. From the inequality constraints (a) and (b) of Theorem 
1, we state and prove that it is not possible to find the optimal Bayes classifier as 
a pair of hyperplanes for these conditions when d> 2. We modify the notation 
marginally. We use the symbols (aj’^ , . . . , ) to synonymously refer to the 

marginal variances ((Tj , a^, ■ . . ,(r\). 

Theorem 3. Let Xi ~ N{Mi,Si) and X 2 ~ N{M. 2 ,S 2 ) be two normally dis- 
tributed random vectors, such that 



A^i — — A^2 “ 


mi 

m 2 


j I y dTld ^2 


"ai 1 
0 


0 . 
®2 


1 

00 ■ ■ 




_md_ 




. 0 


0 . 





where Uj ^ 1, i = 1, . . . , d. There are no real numbers mi, i = I, . . . ,d, such 
that the optimal Bayes classifier is a pair of hyperplanes. 

The proof of Theorem 3 can be found in the unabridged version of this paper, 
[14], and in [15]. This proof is achieved by checking if there is an optimal pairwise 
linear classifier for all the pairs of axes. This is not possible since, if the condition 
has to be satisfied when the first element on the diagonal is less than unity, the 
second one must be greater than unity. Consequently, there is no chance for a 
third element to satisfy this condition in a pairwise manner, in conjunction with 
the first two elements. □ 

Using the results of Theorem 2, we now analyze the possibility of finding the 
optimal pairwise linear classifiers for the d-dimensional case when some of the 
entries in E 2 are unity. 

Theorem 4. Let Xi ~ N{Mi, Ei) and X 2 ~ N{M 2 , E 2 ) be two normally dis- 
tributed random vectors with parameters of the form of (4). If there exists i such 
that ai ^ 1, and aj = 1, mj — 0, for f — . ,d, i ^ j , then the optimal Bayes 

classifier is a pair of hyperplanes. 

The proof of Theorem 4 can be found in [14, 15]. □ 

We now combine the results of Theorems 1 and 2, and state more general 
necessary and sufficient conditions to find a pair of hyperplanes as the optimal 
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Bayes classifier. We achieve this using the inequality and equality constraints of 
these theorems 

The main difference between Theorem 4 and the theorem given below is that 
in the former, all the elements but one of the diagonal of E 2 are equal to unity. In 
the theorem presented below, there are two elements of the diagonal of E 2 which 
are not equal to unity, and therefore they must satisfy (3) and either condition 

(a) or (b) of Theorem 1. 

Theorem 5. Let Xi ~ N{Mi,S\) and X 2 ~ N{M 2 ,S 2 ) be two normally dis- 
tributed random vectors with parameters of the form of (f). The optimal Bayes 
classifier is a pair of hyperplanes if there exist i and j such that any of the 
following conditions are satisfied: 

(a) 0 < a* < 1, Oj > 1, a* = 1, mk = 0, for all k = I, . . . ,d, k ^ i, 
k ^ j, with 



ai{l-aj)m‘l-l-aj{l-ai)m‘j-^{aiaj-ai-aj + l)\ogaiaj = 0. (5) 

(b) Oi ^ 1, Uj = 1, mj — 0, for all j ^ i ■ 

(c) Oi = I, for all i = 1, . . . ,d . 

The proof of Theorem 5 can be found in [14, 15]. □ 



Note that the final case considered in condition (c) corresponds to the tra- 
ditional case in which the optimal Bayes classifier is a single hyperplane when 
both the covariance matrices are identical. 



4 Linear Discriminants with Different Means 

In [16], we have shown that given two normally distributed random vectors, Xi 
and X 2 , with mean vectors and covariance matrices of the form: 



Ml 



r 

s 



,M2 




a ^ 0 

0 6-1 



and E 2 = 



5-1 0 



0 a-M ’ 



(6) 



the optimal Bayes classifier is a pair of straight lines when where 

a and h are any positive real numbers. The discriminant function for this case is 
given by: 



a{x — r)^ -b b{y — s)^ — b{x r)^ — a{y -f s)^ = 0 . (7) 

We consider now the more general case for d > 2. We are interested in 
finding the conditions that guarantee a pairwise linear discriminant function. 
This is given in Theorem 6 below. 




30 L. Rueda and B..T. Oommen 



Theorem 6. Let Xy ~ N{Mi,S{) and ~ N{M 2 ,S 2 ) be two normal random 
vectors such that 
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The optimal classifier, obtained by Bayes classification, is a pair of hyper- 
planes when 



m'f = m'j . ( 10 ) 

The proof of Theorem 6 is quite involved and can be found in [14, 15]. It is 
omitted here for the sake of brevity. □ 

Theorem 6 can be interpreted geometrically as follows. Whenever we have 
two covariance matrices that differ only in two elements of their diagonal; and 
whenever the two elements in the second covariance matrix are a permutation 
of the same rows in the first matrix, if the mean vectors differ only in these two 
elements, the resulting discriminant function is a pair of hyperplanes. 

Indeed, by performing a projection of the space in the Xi and Xj axes, we 
observe that the discriminant takes on exactly the same shape as that which is 
obtained from the distribution given in (6). Thus effectively, we obtain a pair 
of straight lines in the two dimensional space from the projection of the pair of 
hyperplanes in the d-dimensional space. 

5 Linear Discriminants with Equal Means 

We consider now a particular instance of the problem discussed in Section 4, 
which leads to the resolution of the generalization of the d-dimensional Minsky’s 
paradox. In this case, the covariance matrices have the form of (9), but the 
mean vectors are the same for both classes. We shall show now that, with these 
parameters, it is always possible to find a pair of hyperplanes, which resolves 
Minsky’s paradox in the most general case. 

Theorem 7. Let Xi ~ N{M\, S\) and X2 ~ N{M 2 , X 2 ) be two normal random 
vectors, where Mi = M2 = [mi, . . . and Ei and E 2 have the form of (9). 

The optimal classifier, obtained by Bayes classification, is a pair of hyperplanes. 
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Proof. Prom Theorem 6, we know that when the mean vectors and covariance 
matrices have the form of (8) and (9), respectively, then the optimal Bayes 
classifier is a pair of hyperplanes if mj = mj. Since Mi = M 2 , then m| = mj, 
for j = 1,. . . ,d,i j. Hence the optimal Bayes classifier is a pair of hyperplanes. 
The theorem is thus proved. □ 



6 Simulation Results 

In order to test the accuracy of the pairwise linear discriminants and to verify 
the results derived here, we have performed some simulations for the different 
cases discussed above. We have chosen the dimension d = 3, since it is easy 
to visualize and plot the corresponding hyperplanes. In all the simulations, we 
trained our classifier using 100 randomly generated training samples (which were 
three dimensional vectors from the corresponding classes). Using the maximum 
likelihood estimation method [18], we then approximated the mean vectors and 
covariance matrices for each of the three cases. 

We considered two classes, uii and LO 2 , which are represented by two normal 
random vectors, Xi ~ N{Mi,Si) and X 2 ~ N{M 2 ,S 2 ), respectively. For each 
class, we used two sets of 100 normal random points to test the accuracy of the 
classifiers. 

In all the cases, to display the distribution, we plotted the ellipsoid of equi- 
probable points instead of the training points. This was because the plot of the 
three dimensional points caused too much cluttering, making the shape of the 
classes and the discriminants indistinguishable. 



6.1 Linear Discriminants for Two Diagonalized Classes 

In the first test, DD-1, we considered the pairwise linear discriminant function for 
two diagonalized classes. These classes are normally distributed with covariance 
matrices being the identity matrix and another matrix in which two elements 
of the diagonal are not equal to unity and the remaining are unity. This is 
indeed, the case in which the optimal Bayes classifier is shown to be a pair 
of hyperplanes, stated and proven in Theorem 5. The following mean vectors 
and covariance matrices were estimated from 100 training samples to yield the 
respective classifier: 
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The plot of the ellipsoid simulating the points and the linear discriminant 
hyperplanes in the three dimensional space are depicted in Fig. 1. The accuracy 
of the classifier was 96% for uji and 97% for cu 2 . The power of the scheme is 
obvious! 
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Fig. 1. Example of pairwise linear discriminant for diagonalized normally distributed 
classes. This example corresponds to the data set DD-1. 



6.2 Pairwise Linear Discriminant with Different Means 

To demonstrate the properties of the classifier satisfying the conditions of The- 
orem 6, we considered the pairwise linear discriminant with different means. In 
this case, the diagonal covariance matrices differ only in two elements. These t- 
wo elements in the first matrix have switched positions in the second covariance 
matrix. The remaining elements are identical in both covariance matrices. The 
mean vectors and covariance matrices estimated from 100 training samples are 
given below. 
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Using these parameters, the pairwise linear classifier was derived. The plot of 
the ellipsoid simulating the points and the linear discriminant hyperplanes are 
shown in Fig. 2. With this classifier, we obtained an accuracy of 94% for wi and 
97% for W 2 . 



6.3 Pairwise Linear Discriminant with Equal Means 

We also tested our scheme for the case of the pairwise linear classifier with equal 
means, EM-1, for the generalized multi-dimensional Minsky’s Paradox, This is 
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Fig. 2. Example of pairwise linear discriminant with different means for the case de- 
scribed in Section 4. These classes corresponds to the data set DM-1. 



the case in which we have coincident mean vectors, but covariance matrices as 
in the the case of DM-1. Two classes having parameters like these are proven 
in Theorem 7 to be optimally classified by a pair of hyperplanes. We obtained 
the following estimated mean vectors and covariance matrices from 100 training 
samples: 
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The shape of the overlapping classes and the linear discriminant function 
from these estimates are given in Fig. 3. We evaluated the classifier with 100 
randomly generated test points, and the accuracy was 82% for coi and 85% for 
u! 2 - Observe that such a linear classifier is not possible using any of the reported 
traditional methods. 



6.4 Analysis of Accuracy 

Finally, we analyze the accuracy of the classifiers for the different cases discussed 
above. The accuracy of classification for the three cases is given in Table 1. The 
first column corresponds to the test case. The second and third columns represent 
the percentage of correctly classified points belonging to wi and ui 2 , respectively. 
Observe that the accuracy of DD-1 is very high. This case corresponds to the 
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Fig. 3. Example of pairwise linear discriminant with equal means for the case described 
in Section 5. The data set is EM-1. This resolves the generalized multi-dimensional 
Minsky’s paradox. 

pairwise linear discriminant when dealing with covariance matrices being the 
identity and another diagonal matrix in which two elements are not equal to 
unity, as shown in Theorem 5. The accuracy of the case in which the means 
are different and the covariance matrices as given in (8) and (9) (third row) is 
still very high. The fourth row corresponds to the case where the means are 
identical, referred to as EM-1. The accuracy is lower than that of the other cases 
but still very high, even though the classes overlap and the discriminant function 
is pairwise linear. This demonstrates the power of our scheme to resolve Minsky’s 
Paradox in three dimensions ! 



Table 1. Accuracy of classification of 100 three dimensional random test points gen- 
erated with the parameters of the examples presented above. The accuracy is given in 
percentage of points correctly classified. 



Example 


Accuracy for wi 


Accuracy for ui 2 


DD-1 


96 % 


97 % 


DM-1 


94 % 


97 % 


EM-1 


83 % 


88 % 
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7 Conclusions 

In this paper we have extended the theoretical framework of obtaining optimal 
pairwise linear classifiers for normally distributed classes. We have shown that 
it is still possible to find the optimal classifier as a pair of hyperplanes for more 
than two dimensions. 

We have determined the necessary and sufficient conditions for an optimal 
pairwise linear classifier when the covariance matrices are the identity and a 
diagonal matrix. In this case, we have formally shown that it is possible to 
find the optimal linear classifier by satisfying certain conditions specified in the 
planar projections of the various components. 

In the second case, we have dealt with normally distributed classes having 
different mean vectors and with some special forms of covariance matrices. When 
the covariance matrices differ only in two elements of the diagonal, and these 
elements are inverted in positions in the second covariance matrix, it has been 
shown that the optimal classifier is a pair of hyperplanes only if the mean vectors 
differ in the two elements of these positions. The conditions for this have been 
formalized too. 

The last case that we have considered is the generalized Minsky’s paradox for 
multi- dimensional normally distributed random vectors. By a formal procedure, 
we have found that when the classes are overlapping and the mean vectors are 
coincident, under certain conditions on the covariance matrices, the optimal 
classifier is a pair of hyperplanes. This resolves the multi-dimensional Minsky’s 
paradox. 

We have also provided some examples for each of the cases discussed above, 
and we have tested our classifier on some three dimensional normally distributed 
features. The classification accuracy obtained is very high, which is reasonable 
as the classifier is optimal in the Bayesian context. The degree of accuracy for 
the third case is not as high as that of the other cases, but is still impressive 
given the fact that we are dealing with significantly overlapping classes. 
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Abstract. Representing and reasoning with temporal information is an 
essential part of many tasks in AI such as scheduling, planning and nat- 
ural language processing. Two influential frameworks for representing 
temporal information are interval algebra and point algebra [1, 8]. Given 
a knowledge-base consisting of temporal relations, the main reasoning 
problem is to determine whether this knowledge-base is satisfiable, i.e., 
there is a scenario which is consistent with the information provided. 
However, when a given set of temporal relations is unsatisfiable, no 
further reasoning is performed. We argue that many real world prob- 
lems are inherently overconstrained, and that these problems must also 
be addressed. This paper investigates approaches for handling overcon- 
strainedness in temporal reasoning. We adapt a well studied notion of 
partial satisfaction to define partial scenarios or optimal partial solu- 
tions. We propose two reasoning procedures for computing an optimal 
partial solution to a problem (or a complete solution if it exists). 



1 Introduction 

Temporal reasoning is a vital task in many areas such as planning [2] , scheduling 
[5] and natural language processing [6]. Currently the main focus of research 
has been on how to represent temporal information and how to gain a complete 
solution from a problem. How the information is represented depends on the 
type of temporal reasoning that is needed. 

There are two ways in which we can reason about a temporal problem. The 
reasoning method chosen depends on the information available. If a problem is 
presented with only qualitative information (i.e. information about how events 
are ordered with other events) Qualitative Temporal Reasoning is performed. 
From the sentence ’’Fred drank his coffee while he ate his breakfast” we can 
only gather information about the relative timing of the two events. On the 
other hand, information can be presented as quantitative information, that is 
information about when certain events can or do happen. For example, Fred ate 
his breakfast at 7:35am and drank his coffee at 7:40am. For this paper we deal 
only with qualitative information. 

M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 37-49, 2001. 

© Springer- Verlag Berlin Heidelberg 2001 




38 M. Beaumont et al. 



Current research has been aimed at finding a complete solution or determin- 
ing that a problem has a solution [1,8,4]. If the problem is not solvable then 
only an error is provided. However in many situations simply determining that 
the problem has no solution is not enough. What is needed is a partial solution, 
where some of the constraints or variables have been weakened or removed to 
allow a solution to be found. 

While there has been no research on finding a partial solution to an overcon- 
strained temporal reasoning problem, there has been research on finding partial 
solutions to overconstrained constraint satisfaction problems (OCSP). One such 
approach is Partial Constraint Satisfaction [3]. Partial Constraint Satisfaction 
takes an overconstrained problem and obtains a partial solution by selectively 
choosing variables or constraints to either remove or weaken. This is done in 
such a way as to minimize the total number of variables or constraints that are 
removed or weakened and leads to an optimal partial solution. 

In this paper we define two methods for finding a solution to an overcon- 
strained Temporal Reasoning problem. The first method uses a standard brute 
force approach with forward checking/pruning capabilities. The second method 
also uses a brute force strategy but replaces forward checking/pruning with a 
cost function that can revise previous decisions at each step of the search. Both 
methods provide the ability to find an optimal partial solution or a complete 
solution if one exists. 

In sections 2 and 3 we give the relevant background information for both 
Temporal Reasoning and Partial Constraint Satisfaction. Section 4 introduces 
both methods and explains in detail how they work. We also present some pre- 
liminary experimental results in Section 5. 

2 Interval and Point Algebra 

The way in which qualitative temporal information is represented plays a key 
role in efficiently finding a solution to a problem or determining that no solution 
exists. Two representation schemes are Allen’s Interval Algebra [1] and Vilain 
and Kautz’s Point Algebra [8]. 

Interval algebra (lA) represents events as intervals in time. Each interval has 
a start and an end point represented as an ordered pair (S, E) where S < E. The 
relation between two fixed intervals can consist of one of the 13 atomic interval 
relations. The set of these relations is represented by / and is shown in table 1. 

Representing indefinite information about relations between non-fixed inter- 
vals can be achieved by allowing relations to be disjunctions of any of the atomic 
relations from the set /. By allowing disjuncts of the 13 atomic relations we can 
construct the set A containing all 2^^ possible binary relations, including the 
empty relation 0 and the no information relation I. The relation I is known as 
the no information relation because it contains all of the atomic relations, this 
implies that nothing is known about the relationship between two events that 
have this relation. To complete the algebra Allen also defined 4 interval opera- 
tions over the set A: intersection, union, inverse and composition. The operations 
and their definitions are shown in table 2. 
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Table 1. The set / of all 13 atomic relations. 
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Table 2. The 4 interval operations and their definitions. 



Operation 


Symbol 


Formal Definition 


Intersection 


n 


V®, y xAi Pi A 2 P iff xAiy A xA 2 p 


Union 


u 


V®, y x{Ai U A 2 )y iff xAiy V xA 2 y 


Inverse 




Vx,y xA'~'y iff yAx 


Composition 


o 


Vx, y x{Ai o A 2 )y iff 3z xAiz A zA 2 y 



A temporal problem expressed with lA can also be represented as a temporal 
constraint graph [4] . In a temporal constraint graph nodes represent intervals 
and the arcs between nodes are labeled with interval relations. Such a graph can 
be re-expressed as a matrix M of size n * n where n is the number of intervals 
in the problem. Every element of the matrix contains an interval relation from 
the set A of all possible interval relations with two restrictions: for the elements 
Mii the interval relation is always = and Mji — . 

One of the key reasoning tasks in lA is to determine if a problem is satisfi- 
able. A problem is satisfiable if we can assign a value to each interval’s start and 
end point such that all the interval relations are satisfied. Satisfiability can be 
determined by the use of the path-consistency method [1]. The method simply 
computes the following for all a, 6, c of the matrix M: 

Mac = Mac HiMab O Mbc) 

until there is no change in Mac- A matrix M is said to be path-consistent when 
no elements are the empty set 0 and there is no further change possible in M. 
However, as shown by Allen [1], path-consistency does not imply satisfiability 
for interval algebra. In fact determining satisfiability for lA is NP-Hard [8] 
and a backtracking algorithm is required with path consistency to determine 
satisfiability. 

Point Algebra (PA) differs from lA in that events in time are represented 
as points instead of intervals. By representing events as points the relations 
between events are reduced to three possibilities {<,=,>}. The set P = {0, < 
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,<,=,>,>,^,?} contains every possible relation between events. The relation 
? = {<, =, >}, means no information is known about that relation. 

PA is more computationally attractive than lA in that for PA path-consistency 
alone ensures satisfiability [8] . It is also possible to encode some relations from 
lA into PA [8]. However a major disadvantage of PA is that expressive power 
is lost by representing time as points. 

While computing satisfiability of the full set of interval relations A is NP- 
Hard there exist subsets of A that require only polynomial time. The SA^, subset 
defined by Van Beek and Cohen [7] contains all the relations from A that can be 
converted to PA. Another popular subset is the ORD-Horn maximal subset H 
which contains all the relations from A that provide satisfiability for the path- 
consistency method [4] . The ORD-Horn subset also includes all the relations in 
SAc such that SAc C H. 

3 Partial Constraint Satisfaction 

Partial Constraint Satisfaction (PCS) [3] is the process of finding values for a 
subset of the variables in a problem that satisfies a subset of the constraints. A 
partial solution is desirable in several cases: 

• The problem is overconstrained and as such has no solution. 

• The problem is computationally too large to find a solution in a reasonable 
amount of time. 

• The problem has to be solved within fixed resource bounds. 

• The problem is being solved in a real-time environment where it is necessary 
to report the current best solution found at anytime. 

There are several methods that can be used to obtain a partial solution [3] : 

1. Remove variables from the problem. 

2. Remove constraints from the problem. 

3. Weaken constraints in a problem. 

4. Widen the domains of variables to include extra values. 

Removing a variable from the problem is a very drastic approach to obtain a 
partial solution. By removing a variable, all the constraints associated with that 
variable are also removed. Conversely if, when removing constraints, a variable 
is left with no constraints, then this effectively removes that variable. Weakening 
a constraint to the point where that constraint no longer constrains the variable 
effectively removes that constraint from the problem. From this we can see that 
methods 1 and 2 are really special instances of method 3. The fourth method 
however has no relation to the other methods. If a variable’s domain is widened 
to the extent that it includes all possible values, the constraints on that variable 
can still make it impossible to assign a value to that variable, and hence the 
variable is not removed from the problem. 
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No matter the method that is chosen to find a partial solution there is still the 
question of what constitutes an optimum partial solution. The simplest method 
is to count the number of variables/constraints removed or the number of do- 
mains/constraints weakened. The solution that provides the minimal count is 
then considered optimal (a solution with a count of 0 equates to a fully con- 
sistent solution) and the solution count can be used to represent the solution 
cost. 



4 Temporal Constraint Partial Satisfaction 

While current temporal reasoning algorithms are relatively fast and efficient 
they are unable to provide partial solutions to overconstrained problems. Many 
applications, such as scheduling, require a solution to the problem presented even 
when the problem is overconstrained. Applying the current temporal reasoning 
algorithms will only identify that the problem is indeed overconstrained and as 
such has no solution. To address this shortcoming we introduce two algorithms 
for finding partial solutions: 



4.1 Method 1 

The first method uses a standard branch and bound search with forward check- 
ing/pruning to gain an optimal partial solution. The algorithm starts by initial- 
izing a dummy network such that all relations in the network are the relation 
I. This dummy network is then passed to the branch and bound algorithm and 
the search begins. 

First a relation is chosen, which is then divided into two sets: a consistent set 
CS and an inconsistent set IS. The set CS contains only relations that appear 
in both the original relation and in what remains in the dummy network’s rela- 
tion. For example, if the original had the relation {<, m, mi, s} and the dummy 
relation {<, mi, f, fi, >} then the set CS would be {<, mi}. The set IS contains 
those relations not in CS, which in our example would be {f, fi, >}. After this, 
a single relation is chosen first from the set CS and instantiated in the dummy 
network. The Path Consistency algorithm is then called to propagate the effects 
of this instantiation. In the event that the branch and bound algorithm back- 
tracks to this point or the Path Consistency call fails, another atomic relation 
is chosen. If all relations from the set CS have been tried then atomic relations 
are chosen from the set IS. However when a relation from the set IS is chosen, 
a cost count is incremented to reflect that a relation was chosen in conflict with 
the originally desired relations. 

If the Path Consistency call was successful then another relation is chosen 
and the process begins again. At anytime if the cost of the current path exceeds 
the current best cost then backtracking occurs to a point where the cost is 
lower than the best cost and processing is started again. When all relations are 
exhausted the best result is returned as the optimal solution. 
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Input: Original: The original network 

Dummy: A dummy network 
Cost 

Methodl 

1. Begin 

2. If Cost >= BestCost then backtrack 

3. If PathConsistent(Dummy) fails then backtrack 

4. If there are still relations to process in Dummy then 

5. begin 

6. get next relation (X, V) from Dummy 

7. CS = Dummy[X, Y] n Original[X, Y] 

8. IS = Dummy[X, Y] - CS 

9. for all i in CS do 

10. begin 

11. instantiate Dummy[X, Y] to i 

12. Methodl (Original, Dummy, Cost) 

13. end 

14. for all i in IS do 

15. begin 

16. instantiate Dummy[X, Y] to i 

17. Methodl (Original, Dummy, Cost + 1) 

18. end 

19. end 

20. else 

21. begin 

22. Record Dummy as the best solution found so far 

23. BestCost = Cost 

24. end 

25. End 



4.2 Method 2 

The second method, as before, uses a branch and bound algorithm to control 
the search. However, unlike the first method, no forward checking/pruning is 
performed as the actual cost is only computed at the end of a search path. 
Instead at each step in the search an approximate cost is found based on how 
many relations need to potentially be changed to make the network consistent. 
With this approximate value a decision is made as to whether to proceed on 
this path or abandon it. This requires two additional calculations, a Real Cost 
function and an Approximate Cost function: 



Approximate Cost Function. At each level of the search it is necessary to 
judge the cost of the partially explored solution. The ApproximateCost function 
finds an approximate cost that is always equal to or less than the real cost of the 
partially explored solution. The reason for using an approximate cost function 
(instead of finding the real cost) is that until all relations are atomic it is very 
costly to find an absolute cost (as finding every inconsistency at this point would 
require a separate NP-Hard search and an additional search to find the best cost). 

To calculate the approximate cost we first determine a lower bound of the 
number of inconsistent triples. A triple is a set of any three nodes from the prob- 
lem. A triple T = (A,B,C) is inconsistent if: MAcfMMAB °Mbc) ^ 0- Testing 
path (A,B,C) is enough to determine an inconsistency. Computing (B,C,A) and 
(B,A,C) is unnecessary due to the fact that if the path (A,B,C) is consistent 
then there is an atomic relation X in Mab and Y in Mbc that make some or all 
atomic relations in Mac consistent. Now if we take the composition of Mac and 
the inverse of Y, the resulting allowed relations will include A. This is because 
given any three atomic relations N, P, Q that are path consistent then Q G {N 
o P), N G {Q o P-) and P G o Q). 
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When a triple is determined as inconsistent it is added to a list and a count 
for each relation in the triple is incremented. The counts for all relations are 
stored in an occurrence matrix O, with each element of O starting at 0. For 
example, if the triple {A, B, C), is inconsistent then Oab, Obc and Oac are all 
incremented by 1 to represent that each relation occurred in an inconsistency. 



Input: Network M 

Output: A List containing all inconsistent triples 

A matrix O recording the occurrence count for each relation 
Determineinconsistencies 

1. Begin 

2. For A = 1 to size(M) — 2 do 

3. For B = A + 1 to size(M) — 1 do 

4. For C = B + 1 to size(M) do 

5. begin 

6. If P|(-^AS ° = 0 do 

7. begin 

8. add (A,B,C) to List 

9. increment 0^3, and O by 1 

10. end 

11. end 

12. return (List, O) 

13. End 



Once the list of inconsistencies is determined it is processed to find an approx- 
imate number of relations to weaken to remove all inconsistencies. In simplified 
terms the algorithm takes a triple from the inconsistency list and tries each re- 
lation one at a time, effectively performing a brute force search. However there 
are some special circumstances which allow the algorithm to be more efficient: 

The first situation occurs when every relation in a triple occurs only once. 
Here it does not matter which relation is chosen, as no other triple will be re- 
moved from the list, hence the cost is incremented by 1 and processing continues. 
Lines 9-13 of the following code handle this case. 

The second situation is when a triple is chosen that contains a relation that 
has already been selected. In this case the occurrence matrix is reduced by 1 
for each relation in the triple and the cost remains the same. Lines 14-20 of the 
following code handle this situation. 

The last case is when a triple contains certain relations that only occur once. 
These relations are ignored as choosing them will not affect other triples and 
therefore will provide no possibility of offering a lower approximate cost. Line 
23 is used to check for and handle this case. 



Input: List of inconsistencies 

An occurrence matrix O 

Cost 

BestCost 

ApproxCost 

1. Begin 

2. If Cost >= BestCost then backtrack 

3. If there are no more triples left in List then 

4. begin 

5. BestCost = Cost 

6. backtrack 

7. end 

8. get and remove the next triple (A,B,C) from List 

9. If 0>1B ^ AC ^BC equal 1 then 

10. begin 

11. ApproxCost(List) O, Cost + 1, BestCost) 

12. backtrack 

13. end 

14. If or or Oqq <= 0 then 

15. begin 

16. decrement all three relations in the occurence matrix O by 1 

17. ApproxCost(List) O, Cost, BestCost) 

18. increment all three relations in the occurence matrix O by 1 

19. backtrack 
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20. end 

21. For each relation R in {AB, AC, BC} do 

22. begin 

23. If 1 then 

24. begin 

25. TVal = Opi 

26. decrement 0 ^ 3 , ^BC ^ 

27. set Oj^ to 0 

28. ApproxCost(List, O, Cost + 1, BestCost) 

29. increment by 1 

30. Or = TVal 

31. end 

32. end 

33. End 



The ApproximateCost function is responsible for calling Determinelnconsis- 
tencies and then passing its results to ApproxCost. 



Input: Network M 

CurrentBestCost the current BestCost value 
Output: Cost 

ApproximateCost 

1. Begin 

2. (List, O) = Determinelnconsistencies(Network) 

3. BestCost = CurrentBestCost 

4. ApproxCost(List) O, 0, BestCost) 

5. return BestCost 

6. End 



Real Cost Function. At the end of a search, when all relations are atomic, 
the real cost of that solution can be determined. Unlike the ApproximateCost 
algorithm, RealCost returns not only a cost but also a consistent network. The 
question arises however of why it is not possible to use the ApproximateCost 
algorithm to determine the real cost when all relations are atomic? When pre- 
sented with the network in Figure 1 it is possible for ApproximateCost to work 
out a minimal cost that does not provide a consistent network. In this network 
the cost of solving is 2, however if the relations chosen are R(A,D) and R(B,C) 
then this still does not provide a solution as no value can be assigned to those 
relations together to make them consistent. 




Fig. 1. 



To handle this problem it is necessary to perform a full PathConsistency 
check at the end of a search. Furthermore it is also necessary to include relations 
with an occurrence of 1 in the search, which impacts the performance signifi- 
cantly. Another problem that can arise is illustrated by the network in figure 1 
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and occurs when relation R(A,D) and R(B,C) are both the relation I. In this 
case the real cost algorithm will never find a solution since no inconsistencies are 
reported by the Determineinconsistencies algorithm. This problem is handled by 
allowing the search path to extend into these relations and thus allowing real 
cost to function properly. Unfortunately this also results in an large increase in 
the search space. 

Finding the real cost of a network is similar to finding the approximate cost 
in that we process a list of inconsistencies to find the least number of relations 
to change to remove all inconsistencies. However we can no longer make use of 
all the special circumstances used in the approximate cost algorithm and some 
extra processing is also required to verify that the solution found is consistent. 

The only special circumstance that can be kept is when one of the relations 
in a triple has an occurrence of 0 or less. As before these triples are ignored 
as they are already solved. However we must record any relation in that triple 
that has an occurrence greater than 0 in the Removed list. This is due to the 
possibility that we may remove a relation from consideration that needs to be 
weakened to gain the optimal cost. Lines 10-18 of the following code handle this 
circumstance. 

All other triples are processed normally and relations that have an occurrence 
of 1 are treated the same as other relations. Since the solution found is required 
to be consistent it is possible that selecting one relation over another, where both 
have an occurrence of 0, could result in the final solution still being inconsistent. 
All relations that are considered here are marked as occurring in the search path. 
Lines 19-32 of the following code process this situation. 

When there are no more triples left, the marked relations in the Removed 
list are then deleted from the list. The Removed list now only contains those 
relations that have no chance of being selected at an earlier stage. The Removed 
list is then passed to the ProcessRemoved algorithm which finds the final cost 
and solution. Lines 3-8 of the following code handle this situation. 



Input: Network M 

List of inconsistencies 
An occurrence matrix O 

NewNet a place to store the best solution 
Cost 

BestCost 

Removed a list of relations 

R.Cost 

1. Begin 

2. If Cost >= BestCost then backtrack 

3. If there are no more triples left in List then 

4. begin 

5. remove all the relations from Removed that have been marked 

6. ProcessRemoved(M, NewNet, Removed, Cost, BestCost) 

7. backtrack 

8. end 

9. get and remove the next triple (A,B,C) from List 

10. If or or Oqq <= 0 then 

11. begin 

12. decrement all three relations in the occurrence matrix O by 1 

13. add the relations that have an occurrence > 0 to Removed 

14. RCost(M, List, O, NewNet, Cost, BestCost, Removed) 

15. remove the relations added to Removed 

16. increment all three relations in the occurrence matrix O by 1 

17. backtrack 

18. end 

19. For every relation R in {AB, AC, BC} do 

20. begin 

21. TRel = 

22. TVal = Oji 

23. decrement by 1 

24. mark all three relations (AB, AC, BC) 
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25. set Ojj to 0 

26. set JVfjj to the relation I 

27. RCost(M, List, O, NewNet, Cost + 1, BestCost, Removed) 

28. = TRel 

29. increment ^BC ^ 

30. = TVal 

31. unmark all three relations 

32. end 

33. End 



Relations are marked incrementally not simply with a boolean value. Marking 
a relation indicates that it has no possibility of being excluded from a search. 

When there are no more triples in List, the function ProcessRemoved is 
called to handle a rare occasion which could otherwise result in the best cost 
not being found. The problem occurs when a triple is removed where one of the 
relations has an occurrence of 0 or less. This makes it possible for a relation 
that should be weakened to gain the best cost to be excluded from a search. 
The ProcessRemoved algorithm initially checks to see if the current solution is 
consistent, if it is then the relations in the Removed list are not processed. If the 
solution is not consistent then one or more of the relations in the Removed list 
need to be weakened to allow a solution. Line 6 checks consistency by calling 
PathConsistent which checks that the supplied network is path-consistent. 



Input: Network M 

NewNet a place to store the best solution 

Removed a list of removed triples 

Cost 

BestCost 

ProcessRemoved 

1. Begin 

2. If Cost >= BestCost then backtrack 

3. If Removed is empty then 

4. begin 

5. TemporyNetwork = M 

6. If PathConsistent(TemporyNetwork) does not fail then 

7. begin 

8. BestCost = Cost 

9. NewNet = TemporyNetwork 

10. end 

11. backtrack 

12. end 

13. get and remove the next relation R from Removed 

14. ProcessRemoved(M, NewNet, Removed, Cost, BestCost) 

15. If Cost < BestCost — 1 then 

16. begin 

17. set Afjj to the relation I 

18. ProcessRemoved(M, NewNet, Removed, Cost + 1, BestCost) 

19. restore to previous relation 

20. end 

21. Add relation R back to Removed 

22. End 



RealCost is similar to ApproximateCost in that it is really an interface to 
the functions that perform the main work. 



Input: Network M 

Cur rent Best Cost 
Output: BestCost 

NewNet a consistent network 

RealCost 

1. Begin 

2. (List, O) = Determinelnconsistencies(M) 

3. BestCost = CurrentBestCost 

4. set list Removed to empty 

5. RCost(M, List, O, NewNet, 0, BestCost, Removed) 

6. return (Cost, NewNet) 

7. End 
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5 Experimental Results 



In this section we present the preliminary results we have obtained by imple- 
menting the algorithms discussed and testing them with generated problems. 
The test problems where generated using Nebel’s temporal reasoning problem 
generator [4]. The experiments were conducted on a Pentium 3 733 MHz pro- 
cessor with 256 megabytes of RAM running the Linux operating system. A label 
size (average number of atomic relations per relation) of 3 and 100 test cases 
were used for all experiments. Each graph uses a different degree, the degree of 
a problem being a percentage value indicating how many relations are unknown. 
For example a degree value of 1 indicates that all relations in the problem are 
known whereas a degree of .25 indicates that only 25% of relations are known. 
For each graph two types of problems were generated: a consistent problem which 
has a consistent solution and a random problem which may or may not contain 
a consistent solution. The Y axis for each graph represents the average run-time 
for a set of problems and uses a logarithmic scale. The X axis (k) shows the 
number of events used in a problem. 
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The results of the four graphs show a trend where Methodl generally per- 
forms better at lower degrees and Method2 performs better at higher degrees. 
This is to be expected as lower degree problems contain a greater proportion of 
unknown relations and Methodl does not need to explicitly explore unknown re- 
lations (unlike Method2). Also at lower degrees there is a higher probability that 
the generated problem will be consistent (both algorithms appear to perform bet- 
ter on consistent problems). More significantly however, Methodl increasingly 
dominates Method2 as k increases, regardless of the problem degree. 

Overall, the preliminary results indicate that Methodl is the better algo- 
rithm due to its predictable nature and better scaling. Whilst Method2 often 
outperforms Methodl on particular problems it is evident that as k gets larger 
Methodl will begin to dominate Method2. Analysing the raw data shows that 
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Fig. 5. Graph D: Degree = .25 



in some cases Method2 takes an unusually long time to find a solution, signif- 
icantly altering the average result. This may be occurring because Method2 is 
performing a detailed search in an unpromising part of the search tree. Such 
behaviour could be modified by using a random restart strategy, an option we 
are currently investigating. 

Whilst it appears that Methodl is the superior algorithm for finding a par- 
tial solution in the backtracking domain, it cannot be easily adapted to other 
forms of searching algorithms. However Method2 was specifically designed so key 
parts (the approximate and real cost algorithms) could be utilized later in other 
searching strategies, for instance local search algorithms. Method2 works on the 
premise of taking an inconsistent solution (where all relations are atomic) and 
then repairing that inconsistent solution to gain a partial solution (obtained by 
the real cost algorithm). Since it is rare that you start with a problem where 
all relations are atomic we have to perform a search, guided by the approximate 
cost algorithm, to obtain this situation. If in a rare occasion we did start with 
such a problem Method2 would by far outperform Methodl. 

For the backtracking domain, as k gets larger Method2 has to search an 
increasingly larger search space without the aid of propagation techniques to 
reduce the search space. This is most likely the reason why Methodl starts 
to perform better for higher k values and makes it the better choice when a 
backtracking search must be used to gain an optimal partial solution. 

6 Conclusion and Future Work 

Finding a partial solution to a Temporal Reasoning problem has not been well 
investigated to date. In this paper we have outlined two algorithms that can be 
used in finding a solution to a TPCS problem. Both algorithms are guaranteed 
to find the optimal partial solution (optimal being the minimum number of 
relations violated). 

The preliminary experimental results show using a traditional branch and 
bound type algorithm is only practical on small sized problems and so is not 
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expected to be useful in more realistic situations. The results also show that 
Methodl, while sometimes being slower than Method2, is more consistent in 
finding solutions and is probably the superior algorithm due its better scaling 
performance. 

For future work we will be extending the experimental results and inves- 
tigating ways to improve the performance of both algorithms. One idea is to 
employ ordering heuristics. These should improve the performance of both al- 
gorithms and particularly address the lack of consistency in Method2. We will 
also be investigating local search techniques to gain partial solutions. Whilst 
local search does not guarantee an optimal solution, experience in other CSP 
domains indicates it may be more effective on larger problems. 
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Abstract. We investigate how the performance of search for solving fi- 
nite constraint satisfaction problems (CSPs) is affected by the level of in- 
terchangeability embedded in the problem. First, we describe a generator 
of random CSPs that allows us to control the level of interchangeability 
in an instance. Then we study how the varying level of interchangeability 
affects the performance of search for hnding one solution and all solutions 
to the CSP. We conduct experiments using forward-checking search, ex- 
tended with static and dynamic ordering heuristics in combination with 
non-bundling, static, and dynamic bundling strategies. We demonstrate 
that: (1) While the performance of bundling decreases in general with 
decreasing interchangeability, this effect is muted when hnding a hrst 
solution. (2) Dynamic ordering strategies are significantly more resistant 
to this degradation than static ordering. (3) Dynamic bundling strate- 
gies perform overall signihcantly better than static bundling strategies. 
Even when finding one solution, the size of the bundles yielded by dy- 
namic bundling is large and less sensitive to the level of interchangeabil- 
ity. (4) The combination of dynamic ordering heuristics with dynamic 
bundling is advantageous. We conclude that this combination, in ad- 
dition to yielding the best results, is the least sensitive to the level of 
interchangeability, and thus, indeed is superior to other searches. 



1 Introduction 

A Constraint Satisfaction Problem (CSP) [12] is the problem of assigning values 
to a set of variables while satisfying a set of constraints that restrict the allowed 
combinations of values for variables. In its general form, a CSP is NP-complete, 
and backtrack search remains the ultimate technique for solving it. Because 
of the flexibility and expressiveness of the model. Constraint Satisfaction has 
emerged as a central paradigm for modeling and solving various real-world de- 
cision problems in computer science, engineering, and management. 

It is widely acknowledged that real-world problems exhibit an intrinsic non- 
random structure that makes most instances ‘easy’ to solve. When the structure 
of a particular problem is known in advance, it can readily be embedded in the 
model and exploited during search [3] , as it is commonly done for the pigeon-hole 
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problem. A challenging task is to discover the structure in a particular problem 
instance. In our most recent research [5,2,4], we have investigated mechanisms 
for discovering and exploiting one particular type of symmetry structure, called 
interchangeahility, that allows us to bundle solutions, and have integrated these 
bundling mechanisms with backtrack search. We have investigated and eval- 
uated the effectiveness of this integration and demonstrated its utility under 
particularly adverse conditions (i.e., random problems generated without any 
particular structure embedded a priori and puzzles known to be extraordinarily 
resistant to our symmetry detection techniques). In this paper, we investigate 
how the performance of these new search strategies is affected by the level of 
interchangeability embedded in the problem. We first show how to generate ran- 
dom problems with a controlled level of inherent structure, then demonstrate 
the effects of this structure on the performance of the various search mecha- 
nisms with and without interchangeability detection. 

Section 2 gives a brief background to the subject and summarizes our previous 
work. Section 3 describes our random generator, designed to create random CSP 
instances with a pre-determined level of interchangeability. Section 4 introduces 
the problem sets used for testing, and demonstrates the performance of our 
search strategies across varying levels of interchangeability. Section 5 concludes 
the paper with directions for future research. 



2 Background and Contributions 

A finite Constraint Satisfaction Problem (CSP) is defined as V={V, T>, C); where 
V={Vi, V 2 , . . ., Vn} is a set of variables, V={Dvi, Dy.^, ■ . ■, Dy„} is the set 
of their corresponding domains (the domain of a variable is a set of possible 
values), and C is a set of constraints that specifies the acceptable combinations 
of values for variables. A solution to the CSP is the assignment of a value to 
each variable such that all constraints are satisfied. The question is to find one 
or all solutions. A CSP is often represented as a constraint (hyper-)graph in 
which the variables are represented by nodes, the domains by node labels, and 
the constraints between variables by (hyper-)edges linking the nodes in the scope 
of the corresponding constraint. We study CSPs with finite domains and binary 
constraints (i.e., constraints apply to two variables). 

Since a general CSP is NP-complete, it is usually solved by backtrack search, 
which is an exponential procedure. We enhance this basic backtrack search 
through the identification and exploitation of structure in the problem instance. 
This structure is in the form of symmetries. In particular, we make use of a type 
of symmetry called interchangeability, which was introduced and categorized by 
Freuder in [7]. We limit our investigations to interchangeability among the val- 
ues in the domain of one given variable. Interchangeability between two values 
for the variable exists if the values can be substituted for one another without 
affecting the assignments of the remaining variables. Two such values are said 
to belong to the same equivalence class. Each equivalence class is a bundle of 
values that can be replaced by one representative of the bundle, thus reducing 
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the size of the initial problem. We call the number of distinct equivalence classes 
in the domain of a variable the degree of induced domain fragmentation, IDF. 

Freuder [7] proposed an efficient algorithm, based on building a discrimi- 
nation tree, for computing one type of interchangeability, neighborhood inter- 
changeability (NI) . NI partitions the domain of a variable into equivalence classes 
given all the constraints incident to that variable. Haselbock [11] simplified NI 
to a weaker form that we call neighborhood interchangeability aecording to one 
constraint (NIc). He showed how to exploit NIc advantageously in backtrack 
search (BT), with and without forward-checking (FC) for finding all the solu- 
tions of a CSP. He also showed how NIc< groups multiple solutions of a CSP 
into solution bundles. In a solution bundle, each variable is assigned a specific 
subset of its domain instead of the unique value usually assigned by backtrack 
search. Any combination of one value per variable in the solution bundle is a 
solution to the CSP. Such a bundle not only yields a compact representation of 
this solution set, but is also useful in the event that one component of a solu- 
tion fails, and an alternate, equivalent solution must be found quickly. In the 
bundling strategy proposed by Haselbock, symmetry relations are discovered be- 
fore search is started. These are static interchangeability relations. We refer to 
this strategy as static bundling. Below we summarize our previous results [5,2, 
4], which motivate the investigations we report here. 

In [5] , we proposed to compute interchangeability dynamically during search 
using a generalized form of Freuder’s discrimination tree, the joint discrimination 
tree of Choueiry and Noubir [6]. We called this type of interchangeability dy- 
namic neighborhood partial interchangeability (DNPI). Since DNPI is computed 
during search, we say that it performs dynamic bundling. DNPI induces less do- 
main fragmentation (larger partitions) than NIc and is thus likely to find larger 
solution bundles. We designed a new search strategy that combines dynamic 
bundling (DNPI) with forward-checking, and compared it to searches without 
bundling and with static bundling (NIc) for forward-checking search, see Fig. I. 
We proved that the relations shown in Fig. 2 (left) hold when searching for all so- 



Search 


Comparison criteria 


Non Bundling [lOj FC 


Number of constraint checks CC, 
nodes visited NV, solution 
bundles SB, and CPU time. 


Static bundling [llj INlc' 


Dynamic bundling [5J UNPl 



Fig. 1. Search and bundling strategies. 



lutions (provided the variable and value orderings are the same for all searches), 
thus establishing that dynamic bundling is always worthwhile when solving for 
all solutions. In addition to the theoretical guarantees of Fig. 2 (left), we showed 
empirically that neither non-bundling (FC) nor static bundling (NIc) search 
outperforms dynamic bundling search in terms of the quality of bundling (i.e., 
number of solution bundles generated) and in terms of the standard compari- 
son criteria for search (i.e., number of constraint checks and number of nodes 
visited) . CPU time measurements were reasonably in-line with the other criteria. 

In [2], we modified the forward-checking backtrack-search procedures of Fig. I 
to allow the integration of dynamic variable- value orderings with bundling strate- 
gies, while looking for all solutions. We examined the following ordering heuris- 
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Fig. 2. Left: Comparison of search strategies assuming the same variable orderings for 
all strategies and while looking for all solutions. Right: Interleaving dynamic bundling 
with dynamic ordering. 



tics: (1) static least-domain (SLD), (2) dynamic least-domain (DLD) and (3) dy- 
namic variable-value ordering (promise of Geelen [9]). The search algorithms 
generated fell into the nine categories shown in Fig. 2 (right). Since the variable 
and value orderings can no longer be maintained across strategies, strong, theo- 
retical results similar to the ones of Fig. 2 (left) cannot be made. We instead make 
empirical evaluations. Our experiments on these nine search strategies showed 
that dynamic least-domain ordering combined with dynamic bundling (DNPI- 
DLD) almost always yields the most effective search and the most compact 
solution space. Further, we noted that although promise reduces significantly 
the number of nodes visited in the search tree, it is harmful in the context of 
searching for all solutions because the number of constraint checks it requires is 
prohibitively large 

Finally, in [4], we addressed the task of finding a, first solution. In addition to 
the ordering heuristics listed above (i.e., SLD, DLD, and promise), we proposed 
and tested two other ordering heuristics, specific to bundling: (1) Least-Domain- 
Max-Bundle (LD-MB) chooses the variable of smallest domain and, for this vari- 
able, the largest bundle in its domain; and (2) Max-Bundle (Max-Bundle) chooses 
the largest available bundle among all bundles of all variables. We found that 
the promise heuristic of Geelen [9] performs particularly well for finding one 
solution, consistently finding the largest first bundle with loosest bottlenecks^, 
and nearly always yielding a backtrack-free search. This must be contrasted to 
its bad performance for finding all solutions in [2]. Further, dynamic bundling 
again proved to outperform static bundling, especially when used in combina- 
tion with promise. Finally, we noted that LD-MD, our proposed new heuristic, 
is competitive with relatively few constraint checks, low GPU time, and good 
bundling. 

The above summarized research established the utility of discovering and 
exploiting interchangeability relationships in general. In all of our past work, 
algorithms were tested on GSPs created with the random generator of Bacchus 
and van Run [1], which did not intentionally embed any structure in the prob- 
lems. This paper furthers our investigation of interchangeability and adds the 
following contributions: 



^ The promise heuristic is by design best suited for finding one solution [9]. 

^ The bottleneck of a solution bundle is the size of the smallest domain in the bundle. 
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1. We introduce a generator of random CSPs that allows us to control the level 
of interchangeability embedded in a problem in addition to controlling the 
size of the CSP, and the density and tightness of the constraints. 

2. Using this generator, we conduct experiments that test the previously listed 
search strategies^ across various levels of interchangeability. See Table 1: 



Table 1. Search strategies tested. 



1 Problem Bundling 


Ordering 


□ 


Finding all solutions X j 


rro-^ 
i Snpi j 




r SLD 1 
[ DLD J 
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Finding first solution X j 


(FC ) 

( Snpi j 


M 


r SLD 
1 DLD 

1 LD-MB [4] 
promise 


1 

[9] J 


[ 



3. We show that: (a) Both static and dynamic bundling search strategies do 
indeed detect and benefit from interchangeability embedded in a problem 
instance, (b) the performance of dynamic bundling is significantly superior to 
that of static bundling when looking for a first solution bundle, (c) Problems 
with embedded interchangeability are not easier, or more difficult, to solve 
for the naive PC algorithm. And (d) Most algorithms are affected by the 
variance of interchangeability. However pLD-ordered search is less sensitive 
and performs surprisingly well in all situations. 

3 A Generator That Controls Interchangeability 

Typically, a generator of random binary CSPs takes as input the following pa- 
rameters (n, a,p, t). The first two parameters, n and a relate to the variables — n 
gives the number of variables, and a the domain size of each variable. The second 
two parameters, p and t control the constraints — p gives the probability that a 
constraint exists between any two variables (which also determines the number 
of constraints in the problem C = ), and t gives the constraint tightness 

(defined as the ratio of the number of tuples disallowed by the constraint over 
all possible tuples between the two variables). 

In order to investigate the effects of interchangeability on the performance 
of search for solving CSPs, we must guarantee from the outset that each CSP 
instance contains a specific, controlled amount of interchangeability. Interchange- 
ability within the problem instance is determined by the constraints. Indeed each 
constraint fragments the domain of the variable to which it applies into equiv- 
alence classes (as discussed below. Fig. 3) that can be exploited for bundling. 
Therefore, the main difficulty in generating a CSP for testing bundling algo- 
rithms resides in the generation of the constraints. We introduce an additional 
parameter to our random generator [14] that controls the number of equivalence 

® LD-MB for finding all solutions collapses to DLD. Because of their poor behavior [2, 
4], we exclude from our current experiments: (1) Max-Bundle for finding a first so- 
lution and (2) all dynamic strategies for variable-value orderings (e.g., promise and 
Max-Bundle) for finding all solutions. 
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classes induced by a constraint. This parameter, IDF, provides a measure of the 
interchangeability in a problem: a higher IDF means less interchangeability. In 
compliance with common practices, our generator adopts the following standard 
design decisions: (1) All variables have the same domain size and, without loss 
of generality, the same values. (2) Any particular pair of variables has only one 
constraint. (3) All constraints have the same degree of induced domain fragmen- 
tation. (4) All constraints have the same tightness. And, (5) any two variables 
are equally likely to be connected by a constraint. 

3.1 Constraint Representation and Implementation 

A constraint that applies to two variables is represented by a binary matrix 
whose rows and columns denote the domains of the variables to which it applies. 
The ‘1’ entries in the matrix specify the tuples that are allowed and the ‘0’ 
entries the tuples that are disallowed. Fig. 3 shows a constraint c, with a = 5 
and t = 0.32. This constraint applies to V\ and V2 with domains {1, 2, 3, 4, 5}. 
The matrix is implemented as a list of row-vectors. Each row corresponds to a 
value in the domain of V\ . Each constraint partitions the domains of the variables 
F2 

1 2 3 4 5 

11 10 0 1 o— r<iw7 [11001] 

y. 2 10 0 11 [ 10011 ] 

3 1 1 0 0 1 row 3 [11001] 

4 1 1 0 0 1 row 4 [11001] 

5 1 1 1 1 1 row5 [11111] 

Fig. 3. Constraint representation as a binary matrix. Left: Encoding as row vectors. 
Right: Domain of Vi partitioned by interchangeability. 

to which it applies into equivalence classes. The values in a given equivalence 
class of a variables are consistent with the same set of values in the domain of 
the other variable. Indeed, c fragments the domain of Vi into three equivalence 
classes corresponding to rows {1,3, 4}, {2} and {5} as shown in Fig. 3. 

We define the degree of induced domain fragmentation (IDF) of a constraint 
as the number of equivalence classes it induces on the domain of the variable 
whose values index the rows of the matrix. Thus the degree of induced domain 
fragmentation of c for Vi is IDF = 3. Since we control the IDF for only one of the 
variables (the one represented in the rows), our constraints are not a priori sym- 
metrical. The domain fragmentation induced on the remaining variable is not 
controlled. Our generator constitutes an improvement of the random generator 
with interchangeability of Freuder and Sabin [8], which inspired us. The latter 
creates each constraint from the conjunction of two components: one component 
controlling interchangeability and the other controlling tightness. The compo- 
nent controlling interchangeability is both symmetrical and non-reflexive (i.e., 
all diagonal entries in the matrix are 0). Therefore, both variables in a binary 
constraint have the same degree of induced domain fragmentation. This sym- 
metry may affect the generality of the resulting constraint. Indeed, Freuder and 
Sabin introduce the second component of their constraint in order to achieve 
more generality and to control constraint tightness. This second component is a 
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random constraint with a specified tightness t. The resulting constraint obtained 
by making the conjunction of the two components is likely to be tighter than 
specified and also contain less interchangeability than specified. To avoid this 
problem, our generator first generates a constraint with a specified tightness, 
imposes the degree of IDF requested, then checks the resulting tightness. Thus, 
in the CSPs generated, we guarantee that both t and IDF meet the specifications 
without sacrificing generality. 



3.2 Constraint Generation 

Constraint generation is done according to the following five-step process: 

Step 1: Matrix initialization. Create an a x a matrix with every entry set to 1. 
Step 2: Tightness. Set random elements of the matrix to 0 until specified tight- 
ness is achieved. 

Step 3: Interchangeability . Modify the matrix to comply with the specified de- 
gree of induced domain fragmentation, see below. 

Step 4: Tightness check. Test the matrix. If tightness meets the specification, 
continue. Otherwise, throw this matrix away and go to Step 1. 

Step 5 : Permutation: Randomly permute the rows of the generated matrix. 

When C constraints have been successfully generated (C = each 

constraint is assigned to a distinct random pair of variables. Note that we do not 
impose any structure on the generated CSP other than controlling the IDF in the 
definition of the constraints. We also do not guarantee that the CSP returned is 
connected. However, when C > n — 1 extensive and random checks detected no 
unconnected CSPs among the ones generated. Obviously, when C > ^ 

connectedness is guaranteed. Below, we describe in further detail Steps 3 and 5 
of the above process. Steps 1, 2, and 4 are straightforward. 

Step 3: Achieving the degree of induced domain fragmentation (IDF). 

After generating a matrix with a specific tightness, we compute its IDF by count- 
ing the number of distinct row vectors. Each vector is assigned to belong to a 
particular induced equivalence class. In the matrix of Fig. 3, rowl, row3 and 
row4 would be assigned to the equivalence class 1, row2 assigned to equiva- 
lence class 2, and row5 assigned to equivalence class 3. When the value of IDF 
requested different from that of the current matrix, we modify the matrix to 
increase or decrease its IDF by one as discussed below until meeting the spec- 
ification. To increase IDF, we select any row from any equivalence class that 
has more than one element and make it the only element of a new equivalence 
class. This is done by randomly swapping distinct bits in the vector selected 
until obtaining a vector distinct from all other rows. Note this operation does 
not modify the tightness of the constraint. To decrease IDF, we select a row that 
is the only element of an equivalence class and set it equal to any another row. 
For example in Fig 3, setting row2 ^ row5 decreases IDF from 3 to 2. This 
operation may affect tightness. When this is complete. Step 4 verifies that the 
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tightness of the constraint has not changed. If it has, we start over again, gen- 
erating a new constraint. If the tightness is correct, we proceed to the following 
step, row permutation. 

Step 5: Row permutation. In order to increase our chances of generating 
random constraints and avoid that the fragmentation determined by one con- 
straint on the domain of a variable coincidences with that induced by another 
constraint, the rows of each successfully generated constraint are permuted. The 
permutation process chooses and swaps random rows a random number of times. 
The input and output matrices of this process obviously have the same tightness 
and interchangeability as this process does not change these characteristics. 



3.3 Constraint Generation in Action 

An example of this 5-step process is shown in Figure 4, where we generate a 
constraint for a = 5, IDF = 3 and t = 0.32. Note that Step 3 and Step 4, which 




Fig. 4. Constraint generation process. 



control the interchangeability and tightness of a matrix may fail to terminate 
successfully. This happens when: (1) No solution exists for the combination of 
the input parameters. It is easy to check that when a = 5,t = 0.04, there exists 
only solutions with IDF = 2, due to the presence of only one 0 in the matrix. And 
(2) although a solution may exist, the process of modifying interchangeability 
in the matrix continuously changes tightness. To avoid entering an infinite loop 
in either of these situations, we use a counter at the beginning of the process of 
constraint generation. After 50 attempts to generate a constraint, it times out, 
and the generation of the current CSP is interrupted. Our current implementa- 
tion of the generator exhibits a failure rate below 5%, and guarantees constraints 
with both the specified tightness and degree of induced domain fragmentation. 



4 Tests and Results 

We generated two pools of test problems using our random generator, each with a 
full range of values for IDF, t, and p and 20 instances per measurement point. The 
first pool has the following input parameters: n = 10, a = 5, p = [.1, 1.0] with a 
step of 0.1, IDF = 2, 3, 4, 5, and t = [.04, .92], with a step of 0.08. The second pool 
has the input parameters: n = 10, a = 7, p = [.1,0.9] with a step of 0.2, IDF = 
2, 3, . . . , 7, and t = [0.04, 0.92], with a step of 0.16. Recall that when p is small, 
the CSP is not likely to be connected, and when p = 1, the CSP is a complete 
graph. Note that instances with IDF = a have no embedded interchangeability 
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Fig. 5. Comparing performance of search for finding one solution, t = 0.28. 



and thus provide the most adverse conditions for bundling algorithms. We tested 
the strategies of Table 1 on each of these two pools, and took the averages and 
the median values of the number of nodes visited (NV), constraint checks (CC), 
size of the bundled solution (when finding the first bundle), number of solution 
bundles (when finding all solutions), and CPU time. We report here only the 
average values since the median are qualitatively equivalent. 

Constraint tightness has a large effect on the solvability of a random problem. 
Problems with loose constraints are likely to have many solutions. As tightness 
grows, the values of all measured parameters (CC, NV, CPU time, and bundle size) 
quickly die to zero because almost all problems become unsolvable (especially for 
t > 0.5). The behavior of the various algorithms is best visible at relatively low 
values for tightness. In Fig. 5 and Fig. 6 we display charts for tightness values of 
t = 0.28 with the second problem pool, where each variable has a domain size of 
7 (a = 7). The patterns observed on this data set shown are consistent across all 
values for tightnesses for both problem pools and are not reported here for lack 
of space. Both figures show that the algorithms are affected by the increasing 
IDF. This effect is more visible in Fig. 6. This demonstrates that our generator 
indeed allows us to control of the level of interchangeability. 



4.1 Finding the First Bundle 

In our experiments for finding the first solution bundle, we report in Fig. 5 the 
charts for CC (left) and bundle size (right). Note the logarithmic scale of both 
charts. The the chart for CPU time is similar to that of CC and is not shown. 

Three of the four DNPI-based searches (DNPIpromise is the exception) 
reside toward the bottom of Fig. 5 proving DNPI performs better than NIc in 
terms of the search effort measured as CC (left), CPU time (not shown), and 
size of the first bundle (right). DNPI seems also more resistant than NIc to an 
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increasing IDF. Even in the absence of embedded interchangeability (large IDF) 
and when density is high (large p), DNPI-based strategies still perform some 
bundling performed (bundle size >1). 

At the left Fig. 5, FC is shown slightly below DNPI at the bottom of the 
chart. One is tempted to think that FC outperforms all the bundling algorithms. 
However, recall that FC finds only one solution, while DNPI finds from 5 up to 
one million solutions per bundle for a cost of at most 5 times that of FC. Fur- 
thermore, DNPI is finding not only a multitude of solutions, but the similarity 
of these solutions makes particularly desirable in practical applications for up- 
dating solutions. 

4.2 Finding All Solutions 

The effects of increasing IDF are more striking when finding all solutions and are 
shown in Fig. 6. It is easy to see in all four charts of Fig. 6 that both static (NIc) 
and dynamic (DNPI) bundling searches naturally perform better where there is 
interchangeability (low values of IDF) than when there is not (IDF approaches a). 
However, this behavior is much less drastic for DLD-based searches, which are less 
sensitive to the increase of IDF than SLD-based searches. Indeed the curves for 
DLD (both NIc and DNPI) rise significantly slower than its SLD counterparts as 
the value of IDF increases. Additionally, we see here more clearly than reported 
in [2] , that search with DLD outperforms search with SLD for all evaluation criteria 
and for all values of p and IDF. 

From this data, one is tempted to think that the problems with high inter- 
changeability (e.g., IDF = 2) are easier to solve in general than those with higher 
values of IDF. This is by no means the case. Our experiments have shown that 
non-bundling FC is not only insensitive to interchangeability, but also performs 
consistently several orders of magnitude worse than DNPI and NIc. This data is 
not shown because it is 3 to 7 orders of magnitude larger than the other values. 

Even when interchangeability was specifically not included in a problem 
(IDF = a), all bundling strategies, more significantly dynamic bundling, were 
able to bundle the solution space. This is due to the fact that as search pro- 
gresses, some values are eliminated from domains, and thus more interchange- 
ability may become present. This establishes again the superiority of dynamic 
bundling even in the absence of explicit interchangeability: its runtime is far 
faster than FC, and its bundling capabilities are clear. 



5 Conclusions and Directions for Future Research 

In this paper we describe a generator of random binary CSPs that allows us 
to embed and control the structure, in terms of interchangeability, of a CSP 
instance. We then investigate the effects of the level of interchangeability on the 
performance of forward-checking search strategies that are perform no bundling 
(FC) and that exploit static (NIc) and dynamic (DNPI) bundling. These strate- 
gies are combined with the most common or best performing ordering heuristics. 
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Fig. 6. Comparing performance of search for finding all solutions, t — 0.28. 



We demonstrate that dynamic bundling strategies remain effective across all 
levels of interchangeability, even under particularly adverse conditions (i.e., IDF 
= domain size). While search with either static or dynamic ordering is able to 
detect and exploit the structure embedded in a problem, DLD-ordered search is 
less sensitive to the absence of interchangeability, performing quite well in all 
situations. In particular, we see that DNPI-DLD reacts slowly to the presence 
or absence of interchangeability while performing consistently well for finding 
either one or all solutions. 

We intend to extend these investigations to non-binary CSPs and also to 
demonstrate that dynamic bundling may benefit from maintaining arc-consistency 
(MAC) of Sabin and Freuder [13]. Additionally, the fiatness of the curves for 
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DNPI in Fig. 5 (left) makes us wonder how search strategies based on bundling 
may be affected by the famous phase-transition phenomenon. 
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Abstract 

Autonomous robots are unsuccessful at operating 
in complex, unconstrained environments. They 
lack the ability to learn about the physical be- 
haviour of different objects. We examine the via- 
bility of using qualitative spatial representations to 
learn general physical behaviour by visual observa- 
tion. We combine Bayesian networks with the spa- 
tial representations to test them. We input training 
scenarios that allow the system to observe and learn 
normal physical behaviour. The position and ve- 
locity of the visible objects are represented as dis- 
crete states. Transitions between these states over 
time are entered as evidence into a Bayesian net- 
work. The network provides probabilities of future 
transitions to produce predictions of future physi- 
cal behaviour. We use test scenarios to determine 
how well the approach discriminates between nor- 
mal and abnormal physical behaviour and actively 
predicts future behaviour. We examine the ability 
of the system to learn three naive physical concepts, 

‘no action at a distance’, ‘solidity’ and ‘movement 
on continuous paths’. We conclude that the com- 
bination of qualitative spatial representations and 
Bayesian network techniques is capable of learning 
these three rules of naive physics. 

1 Introduction 

The AI community has been unable to create successful, au- 
tonomous robots able to operate in complex, unconstrained 
environments. The main reason for this has been the in- 
ability of an agent to reason with Commonsense knowl- 
edge [McCarthy, 1968; Dreyfus, 1992]. A subset of Com- 
monsense knowledge is the body of knowledge known as 
Naive Physics [Hayes, 1978; 1984; 1985]. This is the ability 
to learn and reason about the physical behaviour of objects. 
Humans generate rules about the physical behaviour of ob- 
jects in the real world when they are infants [Bryant, 1974; 
Piaget and Inhelder, 1969; Vurpillot, 1976]. These rules are 
not as accurate as the laws of physics that one learns in school 
and in fact are often quite inaccurate. For this reason, the 
body of physical rules generated by humans is called Naive 



physics. Formal physics can only be learned through empiri- 
cal experiments or education using high level concepts, such 
as forces and energy. It cannot be learned by visual observa- 
tion alone, whereas most aspects of naive physics are learned 
by every normal human child by the age of 2. 

Our research examines the task of learning naive physics by 
visual observation by combining both qualitative spatial rep- 
resentation and Bayesian networks. 

In particular, our specific research questions are: 

1 . ‘How viable is the use of qualitative spatial representa- 
tions for reasoning in a dynamic, spatial domain?’ 

2. ‘What is the critical sampling frequency for successful 
learning and prediction for this approach?’ 

3. ‘Which physical concepts can be learned using this ap- 
proach?’ 

Our contribution is significant for the following reasons: 

• It covers a dynamic, spatial domain 

• It confirms the viability of qualitative spatial representa- 
tions in this domain 

• It confirms the viability of probabilistic reasoning in this 
domain 

• It is the first approach to learning general physical be- 
haviour in a dynamic, spatial domain 

• The knowledge base is learned entirely from observa- 
tion. No a priori knowledge is used 

2 Related Work 
2.1 Naive Physics 

Human developmental psychology research [Bryant, 1974; 
Vurpillot, 1976] shows that infants develop naive physical 
concepts to understand the physical world. The goal of 
setting up a committee to build a naive physics knowledge 
base for AI use was first proposed twenty-two years ago in 
the Naive Physics Manifesto [Hayes, 1978]. This committee 
was never formed and twenty years later Davis [Davis, 
1998] contends that it would be difficult to find more than 
12 researchers who have pursued this type of research. One 
problem he lists is that physics relies heavily on spatial 
knowledge, which is difficult to express and represent. 
Furthermore, we do not have good, tractable algorithms 
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to handle dynamic, spatial reasoning. So far, no general 
knowledge base of naive physics exists. 

2.2 Qualitative Spatial Reasoning 

Qualitative Spatial Reasoning (QSR) [Cohn, 1995; Hernan- 
dez, 1994; Cohn, 1997] is a reasonably recent development 
within AI. It avoids quantitative measurement in its attempts 
to define the physical world in discrete regions labelled with 
qualitative terms ^ This grouping of space into discrete re- 
gions greatly reduces the complexity of spatial reasoning. 
There are an infinite number of ways to carve up space 
into discrete regions. See Mukerjee [Mukerjee, 1997] for 
a summary of the different approaches for defining space. 
There are also a large number of other physical attributes 
that may be included in a qualitative spatial representation, 
including orientation [Zimmermann, 1993], shape [Schlieder, 

1994], motion [Muller, 1998; Musto etai, 1998; Rajagopalan 
and Kuipers, 1994] and even the behaviour of flexible bod- 
ies [Gardin and Meltzer, 1989]. 

Qualitative temporal representations have transitive prop- 
erties due to natural ordering in one dimension [Allen, 
1983]. Unfortunately, these are not carried over into three- 
dimensional space. Forbus, Nielsen and Falling’s Poverty 
Conjecture [Weld and DeKleer, 1990] contends there is no 
powerful, general, purely qualitative spatial representation for 
more than one dimension. This would appear to limit the po- 
tential use of qualitative spatial representations in 2 or more 
dimensional spaces. 

Randell, Cui and Cohn [Randell et ai, 1992] established the 
benchmark RCC8^ calculus for static, topological worlds. 
RCC8 does not cater for dynamic factors, such as velocity, 
which are fundamental to physical behaviour. Furthermore, 
Renz and Nebel [Renz and Nebel, 1999] have shown reason- 
ing in RCC8 to be NP-hard^. 

Given that RCC8-based spatial reasoning is intractable for 
static, 2-dimensional worlds, we believe that another ap- 
proach is required for reasoning in dynamic, 3D worlds. 
When mapping RCC8 to a 3D world where objects cannot 
share the same space, the 8 states can be reduced to 2 relevant 
states, DC (Disconnected) and EC (Externally Connected or 
touching). These 2 states are the basis of the QSR types used 
in our work. 

Almost all reasoning in the field of Qualitative Spatial Rea- 
soning is based on logical inferencing. This is because QSR 
uses a discrete, expressive representation, ideal for logical in- 
ferencing. However, the real world is inaccessible. We be- 
lieve any agent operating in it must be able to reason with 
missing or ambiguous evidence and uncertainty. Logical rea- 
soning does not perform as well as certian types of probabilis- 

* Instead of stating that a keyboard is 0.37 metres North of me, 
0.12 metres West, which is a quantitative description of its location, 
QSR defines location in qualitative terms such as, the keyboard is 
‘in-front-of’ me, ‘on-the-desk’. 

^ Named after the authors, Randell, Cohn and Cui but often re- 
ferred to as the Region-Connected Calculus 
^ Reasoning in both RCC8 and RCC5 is NP-hard in modal logic. 
They did identify a maximal, tractable set of relations through a 
transformation to propositional logic 



tic reasoning under these conditions. Probabilistic reason- 
ing with qualitative spatial representations is an unexplored 
held. We examine the probabilistic reasoning approach. At 
this point our work diverges from the existing literature. 

With the exception of a few researchers [Muller, 1998; 
Musto et ai, 1998], the majority of problem domains using 
QSR are static. The complexity of logical reasoning in static, 
spatial domains means that dynamic, spatial domains are still 
out of reach. Probabilistic tools may allow us to avoid some 
complexity issues in handling dynamic, spatial domains. This 
reduced complexity may allow us to extend beyond static do- 
mains to more realistic dynamic domains. 

QSR representation also has the alleged advantage that it par- 
allels the world representation models used by children under 
6'^, making it a suitable candidate for naive physics reasoning. 

2.3 The Tabula Rasa Approach 

Work on naive physics and commonsense has been cen- 
tred around large, hand-generated knowledge bases, such 
as CYC [Lenat et ai, 1990; Lenat and Guha, 1990; Lenat, 

1995] . These have proved difficult and expensive to create 
and have not yet achieved the desired results. Both Rene De- 
Cartes and Bertrand Russell contended that human concepts 
are built up from a foundation of simpler concepts. We be- 
lieve that a successful approach must be able to learn from 
observation, starting with a blank slate. 

2.4 Bayesian Networks 

Bayesian networks are a powerful probabilistic reasoning 
tool [Cheeseman, 1988; D’Ambrosio, 1999; Russell and 
Norvig, 1995] and have the characteristic of being able to 
handle uncertain, conflicting and missing evidence [Jensen, 

1996] . Their performance in reasoning in these domains is 
better than logical reasoning. This makes them a suitable 
probabilistic reasoning tool for our work. We use Bayesian 
Networks for this work. 

2.5 Event Detection 

Detection of events is the key to learning the rules of naive 
physics. [Tsotsos et al., 1980] was a very early example of 
the use of qualitative methods in a spatial domain. Eerny- 
hough, Cohn and Hogg [Fernyhough et ai, 1998] developed 
a qualitative vision system to identify traffic behaviour. This 
is one of the few recent attempts to use qualitative techniques 
for event detection observed through computer vision. The 
system focused on a very narrow domain but highlighted the 
potential of using discrete or qualitative spatial representa- 
tions. 

Siskind [Mann et ai, 1996; Siskind and Morris, 1996] de- 
veloped a system using probabilistic reasoning tools for dy- 
namic, spatial reasoning. Their system identifies 6 pre- 
defined, human interaction events. The 6 events are push, 

Up to 4 years of age, the human child appears to internally 
represent the world in a qualitative topological form where shape 
and distance are poorly represented. Later, a projective model is 
used, which can represent shape and relative position hut is weak 
on measurement and scale. At the age of 10 to 12 the child attains 
an Euclidean model with full scale, measurement and vector ahili- 
ties [Bryant, 1974; Piaget and Inhelder, 1969; Vurpillot, 1976] 
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pull, drop, throw, put down and pick up. Hidden Markov 
models (HMM) are used to provide a highest likelihood re- 
sult to detect and classify an event. This use of probabilistic 
reasoning methods is the major similarity to our work. In con- 
trast, Siskind’s 6 events are at an ontologically higher level 
than ours and use a priori knowledge to define the events. 
Because of our Tabula Rasa approach, we believe our lower, 
foundational level is a more appropriate basis for a scalable 
general physical reasoning system. Furthermore, we suspect 
that Siskind’s work is not expandable without adding code for 
each individual event to be detected. Our work uses no a pri- 
ori knowledge of physical behaviour and learns from a blank 
slate. This gives our approach the potential to learn equally 
well In a world with vastly different physical behaviour, such 
as under water or in space. 

3 Combining QSR and Probabilistic 
Reasoning 

3.1 The Scenarios 

The system uses simulated, animated scenarios of billiard ball 
motion for training and testing. Simulated scenarios are used 
to avoid the implementation problems of a computer vision 
system. The animated scenarios are generated from planar 
camera views of objects moving in a 3D virtual world. In this 
case, a billiard table is simulated. Because of the planar view 
point, all the scenarios used for this work are accessible and 
appear similar to 2D animations. Later work will investigate 
fully 3D scenarios. The scenarios capture a large range of 
examples of typical physical behaviour. There are 800 sce- 
narios, each typically having 400 frames. Three typical (non- 
sequential) frames of one scenario, showing a collision, are 
shown in figure 1 . 

The position and velocity of all objects in the scenario are 
represented as discrete states, the amount of information be- 
ing determined by the QSR type. 

3.2 The Qualitative Spatial Representation 

Definitions. A QSR type is defined in this document as the 
way that a spatial relation is represented, usually as a vector 
containing 2 to 6 elements called attributes. We use the term 
‘qualitative spatial representation’ even though all qualitative 
content is stripped from the representation and the state 
of each attribute is represented purely as a number. Each 
attribute has a discrete number of possible states, usually 
between 2 and 8, with 3 being very common [Cohn, 1995]. 
The total number of possible relations that a particular QSR 
type can represent is equal to the product of the number of 
states for each attribute. 

NqSRtype = Wo X X W2 X W3 X .. X Nn 

For example, a QSR type with 6 attributes, each with 3 
possible states, could represent 729 different spatial/motion 
relations (3®). 

Hernandez [Hernandez, 1994] lists the different dimensions 
used to classify different QSR types^. The following list 

^citing Freska and Roehig 




Fig. 1. 3 frames from a typical scenario (arrows added to 
indicate velocity) 

locates our QSR types within this framework: 

• Frame Of Reference - We use locally-aligned Cartesian 
and polar coordinates. 

• Representational Primitives - Our work is based on re- 
gions. 

• Spatial Aspects Represented - We examine a topological 
QSR with the addition of velocity. 

• Granularity - We use low resolution QSR types. 

• Vagueness - Our QSR types have no ambiguity. All 
states are mutually exclusive. 

4 different QSR types are examined, 2 using a cartesian rep- 
resentation and 2 using a polar representation. Because the 
scenarios are planar, the QSR types are only 2-dimensional. 
QSROl and QSR02 use an orthogonal (cartesian) format 
and have 81 and 225 possible states, respectively. For both 
QSROl and QSR02, the x and y dimensions are independent. 
These QSR types have 4 attributes. 2 attributes represent po- 
sition in 2 dimensions, x and y and 2 attributes represent the 
respective velocities, Vx and Vy. The attributes have either 3 
or 5 possible states. QSR02 is shown in figure 2. 

QSRPl and QSRP2 are based on polar coordinates and have 
384 and 960 possible states, respectively. Because of the na- 
ture of polar coordinates, both x and y dimensions are cou- 
pled. 

The maximum number of transitions, shown in table 1 is the 
square of the number of possible states. That is, in a random 
world, each state could transition to any other state, including 
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Fig. 2. QSR type QSR02, representing position and velocity in 225 states 



itself. Given the ordered characteristic of the physical world, 
the number of transitions that one would expect to observe 
is considerably less, approximately 3-8% of the maximum 
number possible. This is discussed in the next section. 



Table 1 : Different QSR types tested 



QSR 

Type 


Number 

of 

States 


Maximum 
number of 
transitions 


Expected 
number of 
transitions 


QSROl 


81 


6,561 


576 


QSR02 


225 


50,625 


2,304 


QSRPl 


384 


147,456 


10,752 


QSRP2 


960 


921,600 


53,760 



3.3 Complexity of QSR Composition Tables in 
Dynamic Domains 

The natural ordering of QSR states [Cohn, 1995; Hernan- 
dez, 1994] greatly reduces the number of transitions from one 
state to another that should be observed. For QSR02 that 
has 225 possible states, the actual number of observed tran- 
sitions is only 2,300 instead of the 50,000 theoretical maxi- 
mum. This characteristic of almost all QSR types is due to the 
fact that objects move along continuous curves in our physi- 
cal world. Any change in position, when measured on a dis- 
cretized scale, tends to be either an increment or a decrement 
of the previous position. This characteristic of QSR is very 
important for work in dynamic domains. There does not ap- 
pear to be a consensus on the naming of this characteristic 
so we adopt the term ‘transitive characteristic’ based on the 
transitivity of ordering mentioned by Cohn [Cohn, 1995]. 



rate, the object speed and resolution of a QSR type if the 
transitive characteristic is to be satisfied. 



sample frequency{s 



object speed{ms 
minimum QSR resolution{m) 



If the sample frequency is too low or the resolution or 
object speed are too high, the observed object may pass 
through several states between observation samples. If it 
becomes possible for the object to move from any one state 
to any other state between sample times then the number of 
observable transitions climbs quickly. The problem deterio- 
rates to an intractable level of complexity and probabilistic 
reasoning is degraded. For this reason, maintaining the 
transitive characteristic is very important for QSR work 
that involves motion. Because of the importance of this 
characteristic, we define it as the transitivity criterion. 
Unfortunately, velocity (and acceleration) attributes do not 
satisfy the transitivity criterion at normal sampling frequen- 
cies. Whilst position appears to change transitively, velocity 
appears to be discontinuous at normal sampling frequencies®, 
typically 5-25 Hertz. For example, a ball may appear to 
change direction instantly as a result of a ‘bounce’ observed 
at normal sampling frequencies. Each position attribute 
usually has 3 expected next states, either an increment, a 
decrement or no change, whereas a velocity attribute could 
change to any value. This means we must monitor the 
potential combinatorial growth that may result from using 
QSR types that have high resolution velocity attributes. 
Therefore, there is a strong motivation for reducing the 
number of position attributes and increasing the resolution 
of those attributes while reducing both the number and 
resolution of velocity attributes. 



3.4 Transitivity Criterion 

Because of the importance of this transitive characteristic, it 
is essential to use an image sampling rate that is fast enough 
compared to both the resolution of the QSR and the speed of 
the objects. There is a linear relationship between the sample 



®In fact, both velocity and acceleration are continuous and tran- 
sitive for all objects with mass. However, satisfying the transitivity 
criterion for velocity or acceleration attributes requires resolution 
and sampling frequencies an order of magnitude higher than that for 
position attributes. 
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3.5 Predicting physical behaviour 

Following the comparison of the 4 different QSR types, we 
examine one typical orthogonal QSR type in depth. We 
determine its capability to represent sufficient spatial and 
motion information to learn three basic rules of naive physics. 



3.6 The Bayesian Network 

In our work, the four node network shown in Figure 3 is used. 
The network starts out empty. The following four things are 




Fig. 3. 4 Node Bayesian Network 



represented in the network: 

• ‘Reference object (RO) type’, the type of object acting 
as a reference point. For example, type 17 represents a 
billiard ball. 

• ‘Located object (LO) type’, the type of the object to 
which the spatial relation refers. 

• ‘Current relation’, the current state of the relation be- 
tween the focal and reference object, represented in the 
relevant QSR type format. 

• ‘Next relation’ , the next state of the relation between the 
focal and reference object, represented in the relevant 
QSR type format. In a Bayesian network that includes 
all observed evidence, the number of states stored in this 
node would equal the number of transitions observed 
within all scenarios. 

This network allows the system to use the object types, both 
reference and located, and their current relation in terms of 
a QSR type for prediction. The position and velocity of the 
objects shown in the scenarios are represented as qualitative 
states. Transitions between these states from frame to frame 
are used as updating evidence. The node probabilities are 
generated by fractional updating [Jensen, 1996]. Each node 
may grow to have many different node states. 

Prediction of future relations is based on the Markov assump- 
tion that the future state is independent of the past state given 
the present state. By querying the Bayesian network based 
on the present state, the network can provide the probability 
of the next state in terms of the QSR type being used. The 
network thereby provides the probabilities of the future rela- 
tive positions and motions of the located object, and hence, its 



future physical behaviour. It can even do this if some of the 
present state evidence is missing. This information is used 
to build a prediction graph of possible future states. Active 
prediction does a heuristic search of this graph to determine 
the most probable next state. By increasing the number of 
reference objects, the located object’s position is more tightly 
constrained at the expense of higher complexity for the pre- 
diction graph. We use either 2 or 3 reference objects. 

3.7 The Naive Physics concepts 

There are 7 physical concepts that are learned by the time a 
human infant reaches 18 months of age. We establish 3 of 
these 7 concepts as the goals of our system. They are: 

• ‘No action at a distance’ , the concept that objects do not 
affect other objects unless they touch. 

• ‘Solidity’, the concept that no two objects can occupy 
the same space. This leads to the concepts of ‘support’ 
and ‘collision’. 

• ‘Movement on continuous paths’, the concept that ob- 
jects move along continuous curves. 

The other four main, foundational concepts are ‘object per- 
manence’, ‘consistency’, ‘inertia’ and ‘gravity’. 

4 Experimental Results 

Two series of tests were conducted. The first series compared 
4 QSR types, examining the growth in the number of ob- 
served states and observed transitions for the 4 types. The 
purpose of these tests was to confirm that the growth in ob- 
served transitions approached an asymptotic maximum far 
lower than the maximum number of possible transitions. Fur- 
ther, we wanted to demonstrate that this phenomenon applied 
to all the tested QSR types. 

The second series examined the training, testing and predic- 
tion abilities of one of the QSR types when combined with 
a Bayesian network. This series of tests involved learning 
though observation, classifying abnormal behaviours and pre- 
dicting future physical behaviour. The purpose of these tests 
was to demonstrate the viability of combining probabilis- 
tic reasoning and qualitative spatial representations to learn 
naive physical behaviour. 

4.1 Comparing 4 Different QSR Types 

A training set of 800 scenarios was generated for these 
tests. The tests were repeated 4 times, once with each QSR 
type. The system observed all 800 scenarios, using one 
of the selected QSR types to represent the spatial relations 
observed. As more scenarios were observed, both the 
number of observed states (spatial relations) and the number 
of observed transitions between these states increased. After 
viewing the 800 scenarios, the system will have observed 
almost all possible states and up to 900,000 transitions. The 
growth graphs for the 4 QSR types were compared. 

The actual number of observed transitions after 800 scenarios 
is shown in table 2. It can be seen that the higher resolution 
QSR types have a far higher number of expected transitions. 
From figure 4 it can be seen that, for all tested QSR types, 
the number of observed states grew quickly and approached 
a horizontal asymptote equal to the maximum number of 
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Table 2. Different QSR types tested 



QSR 

Type 


Expected 
number of 
transitions 


Number of 
transitions 
after 800 
scenarios 


QSROl 


576 


2,542 


QSR02 


2,304 


5,636 


QSRPl 


10,752 


6,427 


QSRP2 


53,760 


18,715 



states. The simpler, lower resolution QSR types achieved 
the maximum number of observed states within a very low 
number of scenarios. For example, QSROl observes all 81 
possible states within the first 3 scenarios alone. 





Fig. 5. Growth in the number of observed transitions 



The most significant result is that all transitions growth rates 
shown in figure 5 show a strong trend to approach a horizon- 
tal asymptote. Furthermore, this asymptote is far below the 
maximum number of possible transitions. This supports the 
contention that QSR types that meet the transitivity criterion 
will have a relatively low number of observed transitions 
compared to the maximum number of possible transitions. 
This keeps the computational requirements for probabilistic 
reasoning low. 

Both the cartesian QSR types had more than twice as many 
observed transitions as expected. This is because many of 



the transitions that occured had changes in more than one 
attribute, usually x and y positions. The estimate of the 
expected number of observed transitions is based on only 
one attribute changing state at each transition. If two or more 
attributes can change state together, the number of observed 
transitions can be far greater than the expected number. 

Both the polar QSR types had less observed transitions than 
the expected number. There are 2 factors that contribute to 
this. One is because the polar position can only change by an 
increment or a decrement (assuming the transitivity criterion 
is met). The other, in the case of QSRP2, is that the growth 
had not yet fully plateaued and was still increasing after 800 
scenarios. Higher resolution QSR types need a larger number 
of training scenarios to oberve all the expected transitions. 
For all QSR types, approximately 40% of the observed 
transitions were observed 5 or fewer times and 17% were 
only observed once. This fact could be used to prune rare 
transitions and further reduce the size of the Bayesian 
network, and hence the computational resources required 
for reasoning. Even with the large number of transitions 
observed, the system is able to make predictions at almost 
real-time speeds on a 233MHz Pentium II PC. 

4.2 Training for Prediction 

Prior to running the tests, the system was prepared by purg- 
ing the Bayesian network of all data, in effect clearing all 
knowledge of previously learned behaviour. The system was 
then subjected to a training process that involved observing 
75 different training scenarios 14 times each. This exposes 
the system to 1,025 scenario runs. Each of these 75 scenar- 
ios showed a complex interaction of 4 rolling balls bouncing 
off cushions and hitting each other. The scenarios are not in- 
tended to be exhaustive. Our set of 75 training scenarios cov- 
ers 83% of the expected transitions, calculated by comparing 
the number of expected versus actual observed transitions. At 
the completion of training, 400,000 pieces of evidence have 
been processed by the Bayesian Network. 

4.3 Passive and Active Prediction 

Our results revealed that there appear to be two levels of pre- 
diction of which the system is capable. The first one is the 
ability to detect and flag abnormal behaviour. We define fhis 
as passive prediction. The second level of prediction, called 
active prediction, is the ability to observe the first section of 
a scenario and predict the future motion of the objects in- 
volved 

4.4 Test Results 

A series of 58 test scenarios was run. There were 34 ‘normal’ 
scenarios, wherein the behaviour of the objects would be 
subjectively described as normal. There were also 24 ‘abnor- 
mal’ scenarios, wherein the behaviour of the objects would 
be described as abnormal. That is, they were in conflict with 
the rules of naive physics. 

’’ It is interesting to note that the standard psychological tests 
for human infants test only for passive prediction abilities [Bryant, 
1974; Vurpillot, 1976]. 
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Two passive prediction results were recorded for each test 
scenario. The first was the number of normal scenarios 
determined to be normal (true positives). The second was 
the number of abnormal scenarios that were detected as 
abnormal (true negatives). The results are shown in table 3. 
The prediction graphs created during 23 ‘normal’ scenarios 
were examined for each state to determine the accuracy of 
active prediction. These results are shown in table 4. In 
most cases the system correctly predicted the next state but 
there were few scenarios where the correct state was actively 
predicted for all transitions that occurred in the scenario. For 
this reason, the results show the total number of correct and 
incorrect predictions. 



Table 3. Results for passive prediction 



Scenario 

Type 


Normal 
Scenarios 
accepted as 
normal 


Abnormal 
Scenarios 
detected as 
anomalous 


All Objects Stationary 


4/4 100% 


4/5 80% 


Ball rolling to a stop 


8/8 100% 


8/8 100% 


Ball bouncing off wall 


10/10 100% 


5/7 71% 


Ball collides with ball 


3/8 37% 


4/4 100% 


Multiple collisions 


0/4 0% 





Overview of results 

Overall the system is able to perform passive prediction very 
well using only probabilistic data. Exceptions specific to 
each scenario type are noted below. 

Active prediction performance is dependent on the QSR 
attribute resolution. The low resolution attributes used for 
this work create low probabilities when predicting some valid 
state changes. This reduces the accuracy of active prediction. 
The most common active prediction (in 52% of the cases) 
was that the state will remain in the same state^. The ‘same 
state’ prediction is not a default prediction. It must also be 
learned. Also shown in table 4 are the number of correct 
and incorrect active predictions excluding ‘same state’ 
predictions. 

In most cases, the limiting aspect is the resolution of the 



Table 4. Results for active prediction 





Correct 
within 
QSR type 
resolution 


Incorrect 


All active predictions 


142/184 77% 


42/184 23% 


All active predictions 
(excluding same state) 


47/89 53% 


42/89 47% 



QSR type. Future work will investigate which aspects of 

* To avoid the high number of these ‘same state’ predictions over- 
whelming the results, a prediction that a state would remain constant 
for many frames of a scenario was counted as only one correct pre- 
diction. 



QSR types most affect learning physical behaviour. 

Scenario type 0 

Stationary objects remain Stationary: 

The system was able to predict both passively and actively 
that stationary objects will remain stationary. 

It is interesting to note that the system failed to detect one 
abnormality in one of the test scenarios. The abnormality was 
the spontaneous relocation of a reference object. However 
the object remained in the same QSR region. This is an 
example of where a low resolution QSR type is unable to 
detect anomalous behaviour. 

Scenario type 1 

Rolling object decelerates and stops: 

The system was able to predict both passively and actively 
that rolling objects will continue to roll in the same direction 
until they decelerate to a complete stop. The ‘stop’ event 
was always in the prediction graph but because the tested 
QSR had the coarsest resolution in the velocity attributes, 
the probability of a ‘stop’ was low. For this reason, a ‘stop’ 
was never actively predicted as the most probable next state. 
QSR types with higher resolution velocity attributes should 
perform much better in this regard. 

Scenario type 2 

Rolling ball rebounds off wall: 

The system was able to predict both passively and actively 
that a rolling object rebounds from a collision with a wall. 
However, due to the coarse resolution of the position at- 
tributes, the probability of the collision was always low. As 
with the ‘stop’ event for scenario type 2, a ‘collision’ event 
was often not actively predicted. QSR types with higher 
resolution position attributes perform better in this regard. 

Scenario type 3 

Rolling ball collides with another ball: 

The system was able to predict passively and actively that 
two balls will change their velocity as a result of a collision. 
However, the QSR type had insufficient resolution to predict 
actively when a collision would occur. 

Scenario type 4 

Rolling ball collides with many balls and walls: 

The system failed to predict correctly for the last scenario 
type due to the movement of the reference objects. In the 
current implementation, other balls can be reference objects. 
When references move, the system is unable to establish a 
frame of reference and prediction deteriorates. 

4.5 Testing the Transitivity Criterion 

The training and testing were repeated with a configuration 
that ensured the transitivity criterion was not always satisfied. 
This was achieved by reducing the sampling speed by a factor 
of five. This slower sampling would allow a moving object to 
transition from one state to a non-neighbouring state between 
frames. The results, shown in table 5, show the expected de- 
terioration of the prediction ability with the violation of the 
transitivity criterion. 
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The system is able to reliably detect anomalous behaviour 
Table 5. Results with unsatisfied transitivity criterion 



Scenario 

Type 


Normal 
Scenarios 
accepted as 
normal 


All Objects Stationary 


4/4 100% 


Ball rolling to a stop 


4/4 100% 


Ball bouncing off wall 


4/7 57% 


Ball collides with ball 


1/4 25% 



and actively predict future behaviour under the following con- 
ditions: 

• Enough scenarios have been run to generate a compre- 
hensive representation of physical behaviour within the 
Bayesian network. 

• The transitivity criterion is satisfied for position at- 
tributes. 

• The resolution of the QSR is high enough. 

5 Conclusion 

Our research questions were: 

1 . ‘How viable is the use of qualitative spatial representa- 
tions for reasoning in a dynamic, spatial domain?’ 

If the system meets the transitivity criterion, the system 
does not suffer a combinatorial growth in the number of 
observed transitions. We conclude the approach is suit- 
able for reasoning in this dynamic, spatial domain based 
on its ability to identify abnormal behaviour and predict 
future behaviour. 

2. ‘What is the critical sampling frequency for successful 
learning and prediction for this approach?’ 

The sampling frequency is determined by the transitiv- 
ity criterion. The transitivity criterion is a critical fac- 
tor for qualitative, dynamic, spatial representations and 
the accuracy of probabilistic reasoning is degraded if the 
transitivity criterion is not satisfied. 

3. ‘Which physical concepts can be learned using this ap- 
proach?’ 

The approach was able to learn 3 of the foundational 
rules of naive physics, ‘no action at a distance’, ‘solid- 
ity’ and ‘movement on continuous paths’. This knowl- 
edge was used to accurately identify abnormal behaviour 
and, in many cases, to actively predict future behaviour. 

The prediction work was done with one QSR type. We intend 
to look at other QSR types to establish which QSR character- 
istics are important in learning physical behaviour. 
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Abstract. The subset of Elements used to form an independent sub beam of a 
Phased Array Radar Antenna can be found using a two stage Genetic 
Algorithm. The use of Pareto optimisation allows the determination of the 
minimum set of Elements to be used for the desired beam pattern. The outer 
GA optimises the selection of elements to be used in the sub beam, while the 
inner GA optimises the tuning parameters of the selected set of elements. 



1 Introduction 

This paper presents a method for the selection of a subset of elements to construct 
a reference beam for a phased array radar. A reference beam can be defined as a 
secondary beam that has a lower consistent power level over a wider field. 

The continuing decreasing cost of phased array systems has led to increased 
investigation into the possible uses of this technology. One of the main advantages of 
a phased array system is the ability to shape the radiation pattern depending on the 
characteristics desired for the beam at any time. The application of beam shaping for 
radar systems has presented the radar engineer with many new possibilities and new 
requirements to improve systems. One of these requirements is the application of a 
secondary beam for use as a reference when tracking known targets without diverting 
the main search beam. This secondary beam or Reference Beam has different 
characteristics to those usually sought for a main beam, it is wider and does not focus 
its power into a small apencil beam. 



2 Aim 

This paper aims to describe a method used to generate a reference beam that is 
independent of the main beam. The approach is based on the assumption that a small 
proportion of the nodes in an array tile can be diverted away from the main beam and 
used to create a secondary beam with little impact on the main beam. The actual 
impact on the main beam of the secondary beam is not within the scope of this paper. 

The research uses an array configuration developed by CEA Technologies as its 
array topography. For an array of 512 transmitting elements arranged in a triangular 
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grid, the optimum nodes to form a reference beam of a specified width and power are 
sought. This paper describes a method used to find this optimum set of nodes for any 
specified beam or tile pattern. 

2.1 Phased Array Radars 

A phased array radar is a collection of stationary antenna elements, which are fed 
coherently, and use variable phase or time-delay control at each element to scan a 
beam to given angles in space. The multiplicity of elements allows a more precise 
control of the radiation pattern. [1] [2]. The power of the radiation at any point in space 
is the sum of the individual power from each element. This sum is a vector addition to 
take into account the effects of constructive and destructive interference induced by 
the different phases of elements. An example beam pattern is displayed in Figure 1. 




Fig. 1. Possible beam pattern at 0 elevation between 50 & 50 degrees of azimuth. 



2.2 Genetic Algorithms 

A genetic algorithm is a stochastic global search method using principles of natural 
selection. It has three main distinguishing features from other optimization methods. 

Groups of answers are generated in parallel; the GA usually acts upon the encoding 
of parameters and not the parameters themselves, and it uses simple stochastic 
operators. Selection, Crossover, and Mutation, to explore the solution domain for an 
optimal solution [3] [4] [5]. 

An initial population is created at random. Each member of the population is an 
answer to the problem to be optimised. Parents are selected from the population for 
breeding. The child is produced via crossover and mutation to form a new answer. 
This answer is evaluated against a fitness function and the child replaces older 
members of the population. At some time after many generations the program is 
terminated and the best member of the population is the solution found. 

The implementation of a GA can be broken into seven stages: the Encoding of 
Parameters, the determination of a Eitness Function, selection of the GA operators to 
be used: Selection, Crossover, Mutation and Replacement, and the determination of 
the Halting Criterion. 
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Parameter encoding 

GAs use encoding of parameters, not the value of the parameters themselves. 
Where possible the coding should have some underlying relevance to the problem and 
as small an alphabet as practical. In most applications this means a binary 
representation which fits easily into a computer representation of the parameters. [6] 

The represented parameters are often referred to as genes. The collection of genes 
which make up a single answer are correspondingly referred to as a chromosome. 

Example; Phased array radar parameters 

Tile of 4 * 4 elements. Each element has associated phase, amplitude and position: 

Element { Phase 

Amplitude 

Position} 

Array { Array of Elements } 

The position of an element can be stored elsewhere as its value doesn t change and 
may be inferred from the position of the element within the array. 

Phase and amplitude are often controlled by digital phase shifters and modulators 
and can be given values that reflect the number of states possible instead of a floating 
point value. This will reduce the search space for each parameter from the range of 
the floating representation to the actual number of states available. That is from 
approximately +!- 2*302 to 256. 

The resulting representation is; 

Element { Phase = 8 bits 

Amplitude = 4 bits } 

Array} 16 *12 bits) 

The position of each element in the array is inferred from its array index. 



Fitness function 

In a GA, the fitness function calculates a value or values to be used as a 
comparison between different solutions. The simplest form returns a value which 
represents how close the current solution is to the theoretical optimum. More complex 
functions return many values for comparison against multiple objectives. Using our 
simple radar example, a simple fitness function would return the amount of energy 
projected by the array at a set position. More advanced fitness functions might return 
the calculated beam pattern of the array or the maximum side lobe level. 

Population selection 

The selection of solutions for further optimisation is a stochastic process based on 
the relative fitness values of each member of the population. The fitter the solution, 
the more likely that member is going to reproduce and pass its genetic information 
onto the next generation. 



Cross-over 

The process of combining two parents to form a child is based on the 
recombination of genes in nature to form a new chromosome. This recombination 
basically consists of taking some of the genetic information from both parents and 
combing it together to form a new child. 
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Example : 
Parent 1: 


ABC 


DEFG 


Parent 2: 


123 


4567 


Crossover point 




A 


Child: 


ABC 


4567 



Mutation 

Mutation is useful in nature for adding new genetic material into a population. 
Positive mutations help an individual to survive and mutation is bred into the 
population as a whole. A bad mutation will usually result in the child being unfit for 
further reproduction. Mutation is achieved at the representation level with a binary bit 
of a gene being inverted. 



Replacement 

Once a child has been created the method in which it is placed in the population 
needs to be considered. In generational Gas, a number of children equal to the initial 
population are generated and then the old population is replaced completely. An elitist 
strategy may replace all but the best member of the population with children, thereby 
preserving a good solution for further optimisation. A steady state approach 
temporarily allows the population to increase and then performs some filtering on the 
population to remove members until a new level (usually the original population 
level) is reached. 



Stopping criterion 

Evolution will continue until some external event gets in the way. The timing of 
this event can be as simple as ’stop at generation 200’. More complicated stopping 
criteria may use measures of the genetic information present in the population (if 
members share 99% of the genes in a population then there will not be much pressure 
to evolve through reproduction). A common stopping condition is measuring if the 
best solution has changed in the last X generations, and stopping if it hasn t increased 
by a given amount or at all. 

2.3 Implemented Algorithm Specifics 

The implemented algorithm uses two nested GAs to accomplish its task. The inner 
GA is used to find an optimum phase and amplitude setting for a tile consisting of a 
set of nodes. This method of optimising a tile has been demonstrated previously for 
full array tiles [7] [8] [9]. It can be assumed that the optimisation of a limited number 
of nodes on a tile can be achieved in the same way [10][11]. The outer GA uses the 
inner GA as the fitness function for each of its chromosomes. The objective of the 
outer GA is to find a subset of nodes on the array that can be used to generate a sub 
beam of a desired shape. The fitness of any solution can be found by how well the 
beam pattern matches the objective pattern and the number of nodes in the array 
pattern. The fewer nodes the better. 
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2.4 Pareto Definition 

A Pareto front distribution can be visualized as the boundary curve along which 
multiple different solutions lie. Each point on the curve neither dominates nor is 
dominated by any other point with regard to both objective functions. One or more 
points on or closer to the curve may dominate points not on the curve. Pareto curves 
allow the evolution of multiple alternative solutions to be presented to a human for 
further evaluation [12][13]. A Pareto curve has been used in this implementation, as 
there is no mathematically clear trade off between the number of nodes used and the 
shape of the beam. The selection of the best configuration can then be left to an 
expert, who is presented with the different results found by the GA. 




Fig. 2. Pareto Curve showing Dominance 

The outer curve in Figure 2 represents a theoretical Pareto Front. Four points are 
depicted to illustrate the concept of dominance. Point A is better than point C in both 
size and fitness, therefore A dominates C. A is better than B and D in only one 
criterion, and therefore A does not dominate B or D. B dominates C and D, as it is 
smaller and fitter than they both are. C and D do not dominate any point. D may be 
considered better than C as it is only dominated by B where as C is dominated by both 
A and B. 

2.5 The Outer GA 

The outer GA is encoded as a variable sized list containing the elements that are 
present in the sub array. The size has been limited 30 80 elements as other sub array 
sizes are impractical for this specific problem. 

Parents are selected from the population using the tournament selection criterion. 
Under this selection methodology, 4 possible parents are selected from the population. 
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A, B , C and D. AB and CD compete against each other for the selection of parent. 
The winner is determined to be the individual that is dominated by the least number of 
other members in the population. The winners of AB and CD then compete. If 
competitors are equal in their domination a winner is selected randomly. 

The fitness function returns two attributes. The first is the size of the array pattern, 
the smaller the array pattern the better. The second is how closely the beam pattern 
formed by the array pattern matches an ideal pattern. This value is returned by the 
inner GA. Both attributes are used to position the result on the Pareto curve. 



Crossover Point 



Parent 1 



Parent 2 



Child 





Fig. 3. Crossover Example 



Crossover has been implemented as a simple single point crossover. The crossover 
is selected at random based on the length of the first parent. The child is made up of 
the first parent until this point is reached and then from this point until the end of the 
second parent is reached. If the second parent was shorter then the crossover point, 
the child is simply a truncated version of the first parent. See figure 3 for an example. 

Mutation consists of the addition or deletion of a single node into the array. An 
element is selected from the total number of elements, and is added if it is not a 
member of the sub array, or deleted if it is a member. 

The Outer GA utilizes a variable sized population to encourage coverage of the 
Pareto curve. An individual is removed from the population when it is dominated by 
X number of other individuals. X decreases slowly as the total population size 
increases. This mechanism allows a population to grow as it spreads out along the 
Pareto front, and to contract if a solution is found that dominates all others. 

The stopping criterion is based simply on the number of generations processed. 
This could be modified to be based on a level of convergence, or allowed to keep 
evolving until an optimum child has been produced. 



2.6 Inner GA 

The inner GA is an adaptation of methods previously used for planar array 
optimisation for desired beam shapes. This work thus differs from other 
implementations in that rather than optimising an entire array, only a selected subset 
of the array is to be optimised. The objective function is also different although it is 
based on the same type of pattern specification. 

There are two parameters to be encoded for each element, Phase and Amplitude. 
Amplitude is a 4 bit value and Phase is an 8 bit value. Each sub array to be optimised 
is therefore represented as a list of 12 bit binary numbers. 
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-60 -40 -20 0 20 40 60 



Fig. 4. Beam Pattern Objective 

The fitness function is a penalty function where the calculated radiation pattern for 
a tile is compared to a desired pattern. Figure 4 shows the desired radiation levels for 
a 0 elevation cross section of the array pattern between -60 and 60 degrees of 
azimuth. The central region is a relatively wide, low power beam for the radar tile in 
question. The desired pattern will be as close to this level as possible any value not 
on this line will receive a penalty. The two outer regions specify an area we would 
like all side lobes to fall into. Any point above this area will incur a penalty. 





Fig. 5. Fitness function points in azimuth and elevation; Individual Element Gain Function 

As it is not practical to test every point for its power level, a finite set of points 
needed to be selected. Figure 5 illustrates the points specified as an azimuth and 
elevation pair. The inner cluster is the reference beam, and as previously shown, a 
level of 17db is sought at each of these points. The outer donut area is the sidelobe 
area, where all sidelobes are desired to be less then 7db. The power levels used take 
into account the gains that can be achieved from the elements used in the array. The 
gain of a single element is a function of the angle from broadside (that is straight out 
from the array phase). This function is shown graphically in Figure 5. 

Proportional selection is used to select parents from the population. Proportional 
selection involves ranking all the members of the population based on fitness 
function, and then selecting a member with probability of its rank over the sum of all 
ranks. That is, the probability of selecting the nth of N population members is: 
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A single point cross over is used. Mutation is carried out by selecting a gene at 
random from the chromosome, and inverting one of the 12 bits at random. 

A Steady state replacement algorithm is used to add children into the population. 
The population grows by one with the addition of a child and then the worst member 
of the population is removed to bring the population size back to where it started. 




Fig. 6. Possible Stopping conditions 

A combination of stopping criteria have been used: 

1 . The GA is guaranteed to run for a minimum number of generations. 

2. The GA is not allowed to run more then a maximum number of generations. 

3. The GA must improve per generation at a rate as good as the average rate 
from the previous best result obtained. This information is passed to the GA. 

4. The GA may be terminated if there is no improvement in X number of 
generations even if it is performing above the slope in part 3. 

5. If the GA passes the best position previously found the GA will terminate if 
there is no improvement in Y generations where Y is smaller the X. 

Slope was used to halt solutions that did not appear likely to beat the best result so 
far. These sub optimal results were often taking twice as long to converge as better 
results. The slope is most effective when comparing element layouts for a sub array. 
Different sub arrays have different potential fitness and the slope usually filters out 
sub optimal layouts quickly. The two different generational convergence times (steps 
4 and 5) were implemented so results that were looking better than previous results 
would not stop prematurely due to minor stagnations. 




3 Results 

3.1 Inner GA 
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Table 1. Varying Populations, with and without slope stopping criterion 



Without Slope 



Pop 


20 


40 


60 


80 


100 


Ave. Gens 


6360 


4705 


4090 


3700 


3720 


Ave. Fit 


-380 


-413 


-429 


-439 


-443 


Best 


-353 


-371 


-395 


-399 


-403 


Worst Gens 


10800 


7100 


7100 


5300 


5200 



With Slope 



Pop 


20 


40 


60 


80 


100 


Ave. Gens 


5895 


4250 


2865 


3250 


3185 


Ave. Fit 


-383 


-416 


-445 


-445 


-452 


Best 


-353 


-371 


-405 


-399 


-403 


Worst Gens 


7700 


5800 


3600 


4700 


4800 



Differences 



Population 


Reduction in Generations 


Decrease in Best Fitness 


20 


7.31% 


0% 


40 


9.67% 


0% 


60 


29.95% 


2.62% 


80 


12.16% 


0.09% 


100 


14.38% 


0% 



Initial results obtained from the inner GA showed that a small population produced 
similar results to a larger population but with much shorter convergence times. All 
population sizes obtained an increase in fitness when the convergence criteria were 
relaxed. Convergence time of the inner GA was improved by the introduction of the 
slope halting criterion. This addition resulted in an occasional small decrease in the 
best fitness reached and provided a 10 - 30% reduction in processing time. The actual 
reduction in processing was much higher when different sub array patterns were 
compared. Table 1 shows the results obtained from 20 runs of the inner GA with 
differing populations and with and without our slope stopping criterion. 

The unusual results for population 60 can be explained by an exceptionally fast 
convergence to a good result in one of the early runs (run 4). All subsequent runs 
could not match the rate of improvement per generation and stopped where they met 
the slope line. There are many ways that this problem could have been avoided. The 
simplest would be to relax the halting condition, of no improvement in 100 
generations. This would have caused a longer flatter period to be added to the end of 
run 4 decreasing its average slope. The graph also illustrates that run 4 was the only 
run that could have effected the best result obtained. Of course had the best result 
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occurred before run 4 there would have been no problem. The results for population 
of 20 as used in the two stage GA are also presented in Figure 7. 





Fig. 7. Possible Stopping conditions; Population 20 Test Results 



3.2 Outer GA 

The outer GA produced a series of Pareto fronts that provide clear indications of an 
improving result. A sample curve mapped over 1000 generations of the outer GA is 
presented in Figure 8. It shows clearly the large outward movement of the curve 
during early generations. Improvements after this can be seen mainly around 40 to 50 
nodes due to biases in the retention of favouring smaller array patterns and the natural 
drift to a single point on the Pareto front due to the lack of significant niche criteria. 
Figure 8 also shows a combined best and average Pareto Front for 16 runs of the 
combined GA. The array pattern for the highest rank individual is shown in Figure 9. 





Fig. 8. Sample Pareto Front at 100 generation intervals; . Best and average Pareto Front found 

The sidelobes produced by the beam are clearly above those that were desired. 
(There should be no peaks other then the central one). It is also easy to see that these 
sidelobes are between the points specified in the fitness function of the inner loop. 
This can be more clearly seen in the two dimensional representation taken across the 
plane of 0 elevation. The sidelobe positions specified to be below 7 db are 50, 40, 20 
20, and 10 degrees on either side of 0. All these points are below 7 db and therefore 
fit the function given to be optimised. 
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To produce a more useful result for a radar engineer, the evaluation function will 
need to incorporate a finer mesh of points for evaluation within the inner GA s fitness 
function. The results produced clearly show that this method of optimisation can 
produce radar beam patterns specified by a penalty/fitness function. 




Fig. 9. Array Pattern for fittest individual found 



3.3 Discussion and Further Work 

The most limiting factor in this approach is the speed in which the inner GA 
converges to a minimum. A number of methods may be employed to increase this 
speed. The addition of a local search method into the inner GA should produce a 
simpler solution space and allow faster convergence to global optima [14][15]. 

The outer GA also has a large bias towards a smaller number of nodes in a sub 
array. The replacement policy used results in the removal of larger sub array sizes 
from the population when fitter small arrays have been found but there is no removal 
of small arrays with bad fitness, except by a small array with a better fitness. 

The recombination function will on average produce a child with size the average 
of its two parents. Hence successive populations will have replacement pressure 
biasing the small members of the population. A niche mechanism could be 
implemented to prevent or balance this replacement pressure. It should be noted that 
selection pressure will usually favour the larger sized sub arrays as it is possible to 
shape these beams to at least the same extent as the smaller beams (as an amplitude 
can be zero). Natural drift along the Pareto Front would also be minimised or 
eliminated by the introduction of a niche mechanism. 



4 Conclusion 

A two staged GA approach to beam forming and element selection is a viable 
method of producing a good range of solutions against a multi objective criterion. 
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The limitation of an efficient fitness function is a problem that may be overcome 
by a hybrid algorithm in the inner GA. 
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Abstract. We propose a fuzzy rule-based system to map representations of the 
emotional state of an animated agent onto muscle contraction values for the appro- 
priate facial expressions. Our implementation pays special attention to the way in 
which continuous changes in the intensity of emotions can be displayed smoothly 
on the graphical face. The rule system we have defined implements the patterns 
described by psychologists and researchers dealing with facial expressions of hu- 
mans, including rules for displaying blends of expressions. 



1 Introduction 

In this paper we introduce a fuzzy rule-based system that generates lifelike facial ex- 
pressions on a 3D face of an agent based on a representation of its emotional state. 

Within the Parlevink research group at the University of Twente, previous work has 
dealt with natural language interactions between humans and embodied conversational 
agents in virtual environments ([12], [13]). Our aim is to build believable agents for 
several application areas: information, transaction, education, tutoring and e-commerce. 
For an embodied agent to be believable it is necessary to pay attention not only to its 
capacities for natural language interaction but also to non-verbal aspects of expression. 
Furthermore, the mind of believable agents should not be restricted to model reasoning, 
intelligence and knowledge but also emotions and personality. We have therefore started 
exploring computational models of emotional behaviour, [10]. The representations used 
in this work form the basis for the results reported here on the facial expressions of 
emotions. 

Based on the descriptive work by Ekman and Friesen in [2] we define rules to map 
emotion representations onto the contraction level of facial muscles. In the research 
reported on in this paper, we focus on two aspects of facial expression modeling. First, 
we want to take into account the continuous changes in expressions of an emotion 
depending on the intensity by which it is felt. Our fuzzy-rule based approach is chosen 
to assure smooth results. Secondly, we want to find a way to specify combinations of 
expressions, i.e. blends, in accordance with the literature mentioned. 

Earlier work on computational models of emotion and facial expression includes 
the directed improvisation system of Hayes-Roth and van Gent [5], which makes an 
emotion-based selection among animation and audio sequences. Perlin and Goldberg’s 
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Improv animation system [16], [17] layers small animations under the control of scripts 
and state variables including mood and personality. Stern, Frank and Resner [22] develop 
an animated pets system named Petz, with facial expression, posture, and vocalizations 
corresponding to each personality profile and internal emotion state of the character. 
In most of this work, our concerns with modelling intensity as well as blends figure 
less prominently. Blends of emotions are often defined in terms of graphics algorithms 
combining single emotion expressions (using interpolation for instance, [6], [14], [18]) 
instead of relying on the empirical rules described in the literature. Hendrix et al. [6] use 
interpolation to display the intensity of emotions. For expressions of blends emotions 
the basic emotions are arranged on an Emotion Disc with the neutral face in the centre 
and maximal expressions of emotions on the perimeter. Each position in the Emotion 
Disc corresponds to an expression obtained by interpolation between the predefined 
expressions positioned on the disc. This method also does not rely on the empirical 
literature as the emotion intensity may be represented differently in different facial 
regions. Beside using interpolation, Pighin et al. [18] also use regional blending to 
create blends of expressions. However, this method creates uncorrelated facial regions 
which do not appear in the human face. Moreover, they need to use a very complex 3D 
face mesh and to collect photographs of expressions of each basic emotion using camera 
at different positions in order to generate blends of expressions. Therefore, this approach 
is not suitable for our project which aims at realtime animation of the agent. 

The ideas from emotion theory and facial expression on which our work is based are 
summarised in section 2. In section 3 we give an overview of the complete system that 
we have implemented. We then discuss the fuzzy rule based system in section 4 in more 
detail. Some results and the evaluation of the system are presented in section 5. 

2 Emotions and Facial Expressions 

The rule-based system presented below is based on a collection of theories of emotion 
and facial expression proposed by [2], [7], [9] and others that has been labeled as “The 
Eacial Expression Program” by Russell [2 1 ] . In this program, it is assumed that emotions 
can be distinguished discretely from one another. A limited number of these are called 
basic. Opinions differ on what it means for an emotion to be called basic. Russell (o.c.) 
summarises this discussion as follows: “Each basic emotion is genetically determined 
universal and discrete. Each is a highly coherent pattern consisting of characteristic facial 
behavior, distinctive conscious experience (a feeling), physiological underpinnings, and 
other characteristic expressive and instrumental actions." 

In this paper we consider the following six emotions: Sadness, Happiness, Anger, 
Eear, Disgust and Surprise. These are said to be universal in the sense that they are 
associated consistently with the same facial expressions across different cultures ([2]). 
In this book, Ekman and Friesen also describe in details what the expressions for these 
emotions and certain blends look like. 

Emotion feelings may differ in intensity. In [2] it is pointed out how for each of the 
basic emotions the expression can differ depending on the intensity of the emotion. It is 
therefore important for us to build our system on a representation that takes intensities 
into account. 
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The human face is also able to show a combination of emotions at the same time. 
These are called blends. Ekman and Friesen describe which blends of the basic emotions 
occur and what these blends look like universally. We have used their descriptions as the 
basis for our fuzzy rules. 

3 Overview of the System 

Our system maps a representation of the emotional state to a vector of facial muscle 
contraction intensities which is used to control the facial expressions of the 3D face. The 
system, as shown in figure 1 , consists of six components: 

1. The input is an Emotion State Vector (ESV). This is a vector of basic emotion 
intensities represented by a real number: 

ESV = (ei, 62 , ..., eg) where 0 < < 1 

2. The output is a Facial Muscle Contraction Vector (FMCV): 

FMCV = {mi, m 2 , ..., mis) where 0 < < 1 

This is a vector of facial muscle contraction intensities. 

3. The Expression Mode Selection determines whether a single emotion or blend of 
two emotions will be expressed in the 3D face model. 

4. In the Single Expression Mode muscle contraction intensities from a single emotion 
intensity are produced. 

5. In the Blend Expression Mode FRBS muscle contraction intensities from two emo- 
tion intensity values are produced. 

6 . The muscle based 3D face model expresses the emotions. 





Fig. 1. The proposed system 



FRBS. The core is formed by the fuzzy rule-based system (FRBS). Two collections of 
fuzzy if-then rules are used to capture the relationship between the ESV and the FMCV. 
During fuzzy inference, all the rules that are in one of the collections are fired and 
combined to obtain the output conclusion for each output variable. The fuzzy conclusion 
is then defuzzified with the Center of Area(COA) [11] method, resulting in a final crisp 
output. 

The fuzzy if-then rules for both single and blend expressions are based on Ekman 
and Friesen’s summary of the facial movements to express basic emotions [2]. We 
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used several other information sources to map the descriptions of faces from Ekman 
and Friesen onto the values for the muscle contraction intensities that generate the 
expressions. These sources were the Facial Action Coding System (FACS)[3], and the 
book and tutorial by, respectively, Waters and Parke [15], and Prevost and Pelachaud 
[20]. Also our own observations on emotion expression in human faces have played a 
role. We will discuss these rules in more detail in section 4. 

As can be seen from the system, the FRBS is actually broken up into three com- 
ponents: the Expression Mode Selection, the Single Expression Mode FRBS and the 
Blend Expression Mode FRBS. The expression of an emotion in a blend may differ in 
important ways from the expression of the emotion occurring on its own. Typically, for 
a single emotion expression several regions of the face are involved whereas in blends 
one of these regions may be used for the other emotion. We therefore do not want the 
single expression rules to fire when blends occur. It might be possible to build a system 
with just a single collection of fuzzy rules. However this will complicate the statement 
of the rules. 

The emotional state vector, ESV, represents the emotional state of the agent. The 
human face cannot display all the combinations of emotion intensities that can be felt 
at the same time universally and unambiguously. It seems that only two emotions can 
be displayed at the same time, because the face has only a limited number of regions to 
display emotions. The mapping between emotional state and facial display is not direct 
also for other reasons. Several factors may be involved in real persons to decide for an 
emotion that is felt whether or not it will be displayed. There may be cultural rules for 
instance that inhibit showing certain emotions. An Expression Mode Selection module 
can mediate between the emotion state as it is felt and the rules for representing the 
emotions to be displayed. In our current implementation we select either the single or 
blend expression mode based on the intensities of the emotions felt' . 

FMCV The muscle contraction intensities which the rules give rise to are used to 
manipulate the 3D face. Currently, we use 17 muscles and an additional parameter. Jaw 
Angle. The latter determines how far the mouth will be opened. 

We have used the 3D face from Waters [15] for this project. The reason for using this 
face is that it is detailed enough to generate almost every visually distinguishable facial 
movement. It also provides an easy way to dehne a suitable muscle system. 

The muscle system was dehned for the face on the basis of anatomical information 
([15] and [20]). We first created the muscles in the 3D face model at the positions 
similar to those of real muscles. Next we adjusted our muscle system until it produced 
reasonably lifelike effects on the 3D face model. For the adjustments we also relied on 
the photographs in [2]. 

* The Single Expression Mode is selected when a single emotion has to be expressed. This 
the case when only one emotion has an intensity bigger than 0.1 while other emotions have 
intensities close to zero (smaller than 0.1). In this case, the Single Expression Mode Euzzy 
Rule Based System (ERBS) will be used, and the input of the ERBS is the single emotion with 
highest intensity. When the Blend Expression Mode ERBS is used, the input of the ERBS is 
the pair of emotions with highest intensity (in the case that more than two emotions have the 
same highest intensity, two emotions will be randomly selected to express). We certainly do 
not claim psychological realism here. 
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Some muscle types, such as the sphincter muscles, have not been implemented in 
Waters’ 3D model. We therefore replaced the sphincter Orbicularis Oculi and Orbicularis 
Oris, by a collection of other muscle types combined into a circle to produce similar 
effects as the real sphincter muscles. The muscles that are implemented in our system 
can be seen in figure 2 and table 1 . 




Table 1. Implemented muscles in the system 



No. 


Muscle name 


No. 


Muscle name 


1 


Zygomatic Major 


10 


Levator Labii Nasi 


2 


Zygomatic Minor 


11 


Levator Labii Superioris 


3 


Triangularis 


12 


Depressor Supercilli 


4 


Risorius 


13 


Corrugator Supercilli 


5 


Depressor Labii 


14 


Depressor Glabelle 


6 


Mentalis 


15 


Levator Palebrae Superios 


7 


Orbicularis Oris 


16 


Orbicularis Oculi Palebralis 


8 


Frontalis Medialis 


17 


Orbicularis Oculi Orbitalis 


9 


Frontalis Lateralis 







The system is designed to take into account future expansions. First, the introduc- 
tion of the FMCV enables the combination of an agent’s lip movements during speaking 
with facial emotion expression. Secondly, the use of the ESV and the Expression Mode 
Selection allows the integration of the agent’s intention and personality into the model 
without changing the fuzzy rules for expressing emotions. This can be done by distin- 
guishing the real ESV as felt from something like a “to-display" ESV. The “to-display" 
ESV which does not represent the agent’s real emotion state but the emotion state the 
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agent want to express. For example, with a strong personality, the agent may display a 
fake smile to mask sadness hy increasing the intensity of happiness in the “to-display" 
ESV. 

4 The Fuzzy Rule Based System 

The subsystems “single expression mode" and “blend expression mode" are both imple- 
mented using fuzzy logic. Both subsystems must convert an emotion state to a contraction 
level for the facial muscles taking into account the intensity of the emotions. In the lit- 
erature on facial expressions of emotions qualitative descriptions like “surprise then lift 
eyebrows” can be found. In order to take intensities into account as well these (logical) 
rules where transformed into fuzzy rules. The fuzzy rule approach allows us to incorpo- 
rate qualitative descriptions as above with quantitative information (emotion intensity 
and contraction level). Moreover we still have a comprehensible rule-based system in 
which the logical descriptions are visibly encoded. We would miss out on that when 
using other models like neural networks, for instance. 

First we model the emotion intensity by five fuzzy sets (figure 3): VeryLow, Low, 
Medium, High, and VeryHigh. The contraction level of each muscle is described by 
again five fuzzy sets (cf. figure 4): VerySmall, Small, Medium, Big, and VeryBig. The 
exact form of the membership functions and the support of each membership function 
are experimentally determined by hand. 



fiintensityi®™°'^™i 




Degree of Intensity 

Fig. 3. Membership functions for emotion intensity 



As we explained in the previous section, the Expression Mode Selector decides 
whether a single emotion has to be displayed or a blend. The rules in the single-expression 
mode FRBS take on the following form. 

If Sadness is VeryLow then muscle 8’s contraction level is VerySmall, muscle 12’s 
contraction level is VerySmall ... 




Generation of Facial Expressions from Emotion Using a Euzzy Rule Based System 



89 



|i jgygj(muscle_contraction) 




Degree of Level 

Fig. 4. Membership functions for muscle contraction level 



The sample rule above encodes the information presented in the first row of table 2. 
Note that the relation between the emotion intensities and the muscle contraction level is 
not so straightforward that we can use a simple mapping system. The name and position 
of the muscles can be seen in table 1 and figure 2. All the rules for other single emotions 
are represented in the table form and can be found in [1]. 



Table 2. Euzzy rules for emotion Sadness, vh:VeryHigh h:High m:Medium l:Low vhVeryLow 
vs:VerySmall h:Small miMedium l:Big vhVeryBig -:no contraction 



E. Intensity 


m8 


ml2 


ml3 


ml4 


ml7 


ml6 


m3 


vl 


VS 


VS 


VS 


VS 


VS 


- 


- 


1 


s 


s 


s 


S 


s 


VS 


- 


m 


m 


m 


m 


m 


m 


s 


- 


h 


b 


b 


b 


b 


b 


m 


m 


vh 


vb 


vb 


vb 


vb 


vb 


m 


b 



If the Single Expression Mode is not used, then the Blend Expression Mode is 
selected. In this mode two emotions are displayed on the face. Normally each of the 
two emotions is displayed in a separate region of the face. The fuzzy rules for the blend 
of expressions reflect this fact. The contraction level of a muscle is determined by the 
intensity of the emotion that will be displayed in the facial region to which this muscle 
belongs. As the contraction level of each muscle is determined by the intensity of only 
one of the emotions, there will not be conflict values placing on any muscle’s intensity. 
We will illustrate this with a description of the blend of sadness and fear. 

Ekman and Friesen [2] describe how in a such a blend sadness is expressed in the 
brows and eyelids while fear is expressed in the mouth. Combining this with the specifi- 
cation of muscle movements in the FACS, we can define the emotions in muscle terms. 
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Sadness is expressed by contracting the muscles Frontalis Medialis(8), Depressor Su- 
percilli(12), Corrugator Supercilli(13), Depressor Glabelle(14) and Orbicularis Oculi(16 
and 17). Fear is expressed by contracting the muscles Triangularis(3), Risorius(4), De- 
pressor Labii(5) and by opening the jaw. The level of contraction of each of those muscles 
is then determined by the intensities of sadness and fear. The format of such a rule in 
our system looks as follows. 

If Sadness is Low and Fear is Medium then muscle 8’s contraction level is Small, 
muscle 3’s contraction level is Medium ... 

Some examples of the rules are presented in table 3. The full set of rules can be found 
in[l]. 

5 Result and Evaluation 

The expressions of six basic emotions and a neutral face are displayed in figure 5. In 
figure 6, surprise is shown with increasing intensity. The increasing intensity of surprise 
can be seen in the increase in the raising of the eyebrows and the increase in the opening 
of the mouth. Figure 7 (left) shows the blend of anger and disgust. It can be seen that anger 
is represented in the eyebrows and eyelids while disgust is represented in the mouth. 
Blend of happiness and surprise are shown in figure 7 (right). This is a combination of 
surprised eyebrows and a happy smiling mouth. 



Table 3. Fuzzy rules for blend of Sadness and Fear 



Sadness 


Fear 


m8 


ml2 


ml3 


ml4 


ml7 


ml6 


m3 


m4 


m5 


vl 


vl 


VS 


VS 


VS 


VS 


VS 


- 


VS 


VS 


VS 


vl 


1 


VS 


vs 


vs 


vs 


vs 


- 


s 


s 


s 


m 


1 


m 


m 


m 


m 


s 


VS 


s 


s 


s 

























The results also show that the emotions are not only displayed in the main parts of 
the face like mouth and eyebrows but also in very detailed parts like eyelids and lips. The 
blends of expression are displayed according to the rules as described by psychologists 
instead of being computed by some graphics algorithm that combines values for single 
emotion expressions (morphing, interpolation). And finally, the quality of the facial 
expressions is improved by the smooth relationship function between emotion intensities 
and muscles contractions level. This smooth relationship function is obtained with fairly 
simple fuzzy if-then rules rather than with complicated formulas or intensively trained 
Neural Networks. 

For a first evaluation, questionaires were set up to assess the recognizability of the 
expressions generated by the system. The expression of six basic emotions and a neutral 
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Fig. 5. Basic emotions: Neutral, Sadness, Happiness, Anger, Fear, Disgust, Surprise (from left to 
right) 







Fig. 6. Increasing surprise 




Fig. 7. Blend of anger and disgust(left), happiness and surprise(right) 



face generated by the system were shown to 20 people. The result of how they recognised 
these emotional expressions is summarized in table 4. As can be seen from the table, the 
generated emotion expressions are recognised as what they are intended to be by a large 
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percentage of the people. We also showed these people the picture of the generated blend 
expression of Anger and Disgust (figure 7 left) to see how good our blend expression 
generation is. The possible answers that people could choose from were expressions of: 
Sadness and Disgust, Anger and Disgust, Fear and Anger, Fear and Sadness, Sadness 
only, Disgust only, and Anger only. There were 1 1 people who recognised it as a blend 
expression of Anger and Disgust; 3 people recognised it as the expression of Sadness 
only; 1 person recognised it as an expression of Disgust only; and 5 people recognised 
it as an expression of Anger only. So about half of the people recognized it correctly. 
For an appropriate analysis of this result we need further questionaires comparing these 
with similar expressions generated by other systems or with the photos from Ekman and 
Friesen. 



Table 4. Evaluation result on how people recognise generated facial expressions 





Intended 


Neutral 


Sadness 


Happiness 


Anger 


Fear 


Disgust 


Surprise 


Recognised as 


















Neutral 




95% 


- 


- 


- 


- 


- 


5% 


Sadness 




- 


75% 


- 


15% 


5% 


5% 


- 


Happiness 




- 


- 


100% 


- 


- 


- 


- 


Anger 




5% 


5% 




70% 


10% 


10% 


- 


Fear 




- 


15% 


- 


- 


80% 


- 


5% 


Disgust 




- 


5% 


- 


10% 


5% 


75% 


5% 


Surprise 




- 


- 


- 


5% 


- 


10% 


85% 



6 Conclusion and Future Research 

In this paper, we have proposed a fuzzy rule based system to generate facial expressions 
from an agent’s emotional state. With simple fuzzy rules, lifelike facial expressions are 
generated based on descriptions in the literature. The variations resulting from differ- 
ences in the intensity of emotions are also successfully displayed. 

The effect of the fuzzy membership function on the manner of expression is one of 
the issues for future research. In the next phase of the project, we intend to add a learning 
component into the system so that we can have slightly different way of expressing an 
emotion state for different agents. The display of emotions will have to be combined 
with other systems that influence what is shown on the face (like communicative signals 
and lip-movements). The expression mode selector will have to become more complex 
to take into account other factors besides intensity. Finally, other emotions besides the 
“universal” ones will have to be dealt with. 
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Abstract. In this paper, we are proposing an efficient method of clas- 
sifying form that is applicable in real life. Our method will identify a 
small number of local regions by their distinctive images with respect 
to their layout structure and then by using the DP (Dynamic Program- 
ming) matching to match only these local regions. The disparity score in 
each local region is defined and measured to select the matching regions. 
Genetic Algorithm will also be applied to select the best regions of match- 
ing from the viewpoint of a performance. Our approach of searching and 
matching only a small number of structurally distinctive local regions 
would reduce the processing time and yield a high rate of classihcation. 



1 Introduction 

There are some issues related to classifying form documents, such as the one 
involving a feature vector and a classifier. The process of feature extraction 
lightens the load of a classifier by decreasing the dimension of a feature vector 
and as a result enhances the performance of the classifier. In other words, the 
recognition rate becomes high while the computation time is reduced. 

In [1] ten line corner junctions are used as features for document identifi- 
cation. Forms identification was implemented with a neural network, and 98% 
accuracy was acquired on the United States Internal Revenue Service forms. 
The time to calculate corner features was 4.1 CPU seconds on SPARCstation II. 
Horizontal and vertical line elements are easily distinguishable as textual images 
on a form. In [2], business forms were used as a test data, and the nine corner 
types by lines are extracted as features, which were converted as a representa- 
tion by a string. A simple edit distance algorithm was used as a classifier. In 
this research, 1.75 CPU seconds or approximately 25 seconds in real time was 
needed to process a form. Also, matching of the model based on association 
graph was proposed for understanding of form image[3]. In this work, the in- 
formation related to lines was extracted as a feature vector and represented by 
using a graph. The algorithm of graph matching was used as a recognizer and 
14 categories of forms were used as a test data. In this research, the average rate 
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of recognition was 98%, and the computation time was 5.76 CPU seconds. [4] 
proposed a method to recognize all the field items that are enclosed by lines and 
to construct a binary tree that represent neighboring relation among the field 
items by using the upper left coordinations of the field items. Furthermore, the 
binary trees can be used to recognize input forms. 

Another method of form classification was proposed by [5]. In this paper, 
/c-NN, MLP and a new structural classifier based on tree comparison are used 
as classifier. The three classifiers showed good rates of recognition ranging from 
87.31% to 100% according to the related thresholds. However, the pixel by pixel 
operation to extract features is overall a time consuming operation. 

Many methods and approaches in general treated all areas of a form docu- 
ment equally. Some methods are proficient at recognizing the form layout struc- 
ture, and the resulting logical layout of a form is useful for interpretation. How- 
ever, for form documents with complex structures, a new applicable system is 
needed to overcome the lengthy time of processing. Hence, we propose a system 
of form classification that centers on the method of partial matching. By per- 
forming structural recognition and form classification on only some areas of the 
input form, valuable time could be saved. 



2 Overview of the Proposed Approach 



Proposed here is a system for which the model is based on structural knowledge 
on forms to classify an input form. The model-based system operates in two 
stages as shown in Fig. 1: form registration and form classification. In the first 
stage, several local areas in the form image that are distinctive in their structure 
are found. Only these local areas are matched by using the DP matching so they 
may be further analyzed in the next phase, the form classification. 

The process can be summarized as follows: First, structures of the layout 
of all the forms to be processed are identified. Each image is partitioned into 
rectangular-shaped local areas according to specific locations of horizontal and 
vertical lines. Next, the disparity in each partitioned local area of the forms is 
defined(cf.. Sect. 4.1 and Eq. 3). This definition is used to look for all subsequent 
disparity measurements. Then, the preliminary candidates of the matching areas 
are selected according to the scores of disparity(cf., Eq. 5). The DP matching is 
used to calculate the disparity. The final candidates of the matching areas are se- 
lected by considering several approaches as follows: the Largest Score First(LSF) 
method is considered where the average of the disparity values in a matching 
area is used. And, a Maximal Disparity First (MDF) method is applied in which 
a matching area is used in order to classify the form documents which could 
be misinterpreted by the existing matching areas. Finally, the information that 
relates to the matching areas is registered as form models. In the step of classi- 
fication, an input form is classified by using this registered form model. 
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Form Registration 




Form Ciassification 



Fig. 1. A model-based system of form processing 

3 Feature Extraction 

3 . 1 Partition of Forms 

Partition of forms is performed by the identified structures of the forms, which 
are composed of line segments. The following particulars should be done. First, 
the form document must be partitioned so that the feature of the form can be 
extracted robustly. The partitioned areas should contain at least one line that 
allows the matching process. Partitioning must also meet the requirement of 
conferring stability to forms and matching areas that are in transition. 

In this case, necessary for the partitioning process are the location and the 
starting/ending point of the lines in a form. First, the adjacent line segments far- 
thest away from each are bisected. Next, you repeat the process with the resulting 
halves. This process of partitioning is performed recursively until the distance 
between any two adjacent lines is smaller than a certain threshold(partition 
threshold). In detail, the vertical separators are defined to partition the center 
between the neighboring vertical lines and the starting/ending point of horizon- 
tal lines. The horizontal separators are defined to partition the center between 
the neighboring horizontal lines and the starting/ending point of vertical lines, 
as well. Fig. 2 shows an example of the partition. Fig. 2(a) indicates the form 
structure that is overlapped by two kinds of form documents to be processed. 
The dotted lines shown by Fig. 2(b) indicate the horizontal and vertical separa- 
tors which are defined by a certain threshold for partition. Fig. 2(c) shows the 
partitioned result. 

3.2 Feature Extraction 

A feature vector consists of the location and starting/ending position of horizon- 
tal and vertical lines. For example, a vertical line is represented as (vi,Vs,Ve)- 
In this case, Vi stands for the location of a vertical line(a; coordinates), Vs and 
Ve indicate the starting/ending position of a vertical line(j/ coordinates) respec- 
tively. Therefore, if m vertical lines exist in a partitioned area, a feature vector 
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Fig. 2. The distribution of lines and a partition of forms 
is constituted as follows. 

t’slj t’el); ^s2j t’e 2 ); • ■ ■ 5 ‘^sm: t^em)) 

The extracted feature vector are normalized as follows: First, we search the 
maximal width and height of all the partitioned local areas. The x coordinates of 
vertical lines and the x coordinates of starting/ending position of horizontal lines 
are then divided by the maximal width. Also, y coordinates of horizontal lines 
and the y coordinates of starting/ending position of vertical lines are divided 
by the maximal height. As a result, all the elements in a feature vector are 
normalized between 0 and 1. 



4 Form Classification 

4.1 Calculation of Disparity Using DP Matching 

The next step is a procedure to select the matching areas from the partitioned 
local areas. In this case, the selected matching areas should satisfy the following 
conditions: The matching areas should be selected so as to effectively classify 
an input form. More specifically, the input form must be classified individually 
in a reasonable span of time by the matching areas. It is desirable to keep the 
number of matching areas with large geometric differences small. 

When performing the form classification using line information as a feature, 
the following problems must be considered: (l)Noises similar to a line could be 
added. (2)Lines could disappear for unknown reasons. (3)A line can be broken 
into two or more line segments. (4)A line can partially disappear. The computa- 
tion of the disparity by the DP matching method offers a good solution to these 
kinds of problems. 

The disparity represents the distance of a layout structure between two form 
documents in one partitioned area(see Fig. 3). Only two form documents exist 
and each form document is partitioned by n areas, then n disparity values are 
computed. If three form documents (for example A, B, and C) exist, the possible 
pairs are AB, BC, and CA, which means a total of 3n disparity values can be 
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computed. A disparity plane is generated by comparing two form documents in 
which n disparity values exist. A disparity vector is constructed by all the values 
of disparity in the corresponding area of all disparity planes(see Fig. 3). 




\ Disparity plane 

Disparity vector d = (d1 , d2 dn) 



Fig. 3. Partitioned areas and disparity vectors for the local areas 

The process of computation is summarized as follows: At first, the position 
and the starting/ending point of lines in each partitioned local area are extracted. 
And the values are next normalized. A feature vector is composed of the nor- 
malized relative positions of the lines in the particular partitioned area. At this 
point, the disparity could be computed using the DP matching algorithm, which 
is defined as follows: 



g{i,j) =Toij\l -1) + d{i,j) > (1) 

[ 5(bi-l) + C' J 



where i and j represent two indices of the respective vectors, and C is a con- 
stant of DP matching. By g{i,j), the weight in a weighted graph is computed. 
In the cases that the numbers of elements in the two feature vectors are m 
and n, 1 < i < m, 1 < j < n are satisfied. For example, if the number of 
the form document to be processed is 2 and the two feature vectors for each 
area are ((u/i, GsI; o^ei), ( 0 / 2 ? ^ 62)5 ^em)) and (( 6;i, f>si,6ei), 

{bi 2 , t>s 2 , be 2 ), ■ • • , {bim ban, ben)) respectively, then d{i,j) is defined as follows: 

d{i,j) = \aii - bij\ + a{\asi - bs2\ + \aei - 6e2|) (2) 



In the DP matching algorithm of line segments, the distance between two 
matching lines is added to the measure. In detail, the position and the length of a 
line would be determined from the the location and starting/ending points of the 
two lines and added to the measure. In this case, a represents the constant that 
indicates the extent to which the distance of starting/ending point is reflected 
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in the equation. As a result of the DP matching, a weighted graph is generated. 
In the weighted graph, the shortest path, k\,k2, ■ ■ ■ , kg satisfying the following 
condition is found. 



disparity(d) 




( 3 ) 



where w{i) is a function which returns the weight of the fci node. The path has a 
minimal penalty which connects a (0, 0) node and a (m, n) node in the weighted 
graph. 

If the number of form document is n, the total „C2 pairs of form document 
can exist, and the following disparity vector(d) can be obtained for each match- 
ing area(see Fig. 3 ). If the number of matching area is m, then a total of m 
disparity vectors can be obtained. 



d — {di,d2,. ■ ■ , d„C2) 



( 4 ) 



4.2 Selection of Matching Area 

Next, matching areas are selected by using the disparity vector. To select match- 
ing areas all the vector elements, that is, the values of the disparity in each dis- 
parity vector could be considered to compare the two local areas in each member 
of a pair. An important criterion is the recognition rate. Since the DP match- 
ing algorithm is used as a recognizer, the disparity affects the recognition rate. 
Therefore, an area that has a disparity vector with large values of disparity can 
be selected as a matching area. Some strategies to select the matching areas can 
be considered. We suggest the following three methods as strategies to select the 
matching areas. 

Largest Score First(LSF) Method. The simplest method is to use the sum- 
mation and average of all the vector elements (disparity) for each disparity vector 
as a score to select the appropriate matching areas, by which the input forms 
will be classified. That is, the local area with the largest score is selected as a 
matching area. Now we compute the score by using the equation as follows: 

( 5 ) 

n '~-2 — 

1 < I < n, 1 < j <„ C 2 



In this case the score satisfies 0 < Si < I. Si indicates the score for the ith area. 
For example, if there are four disparity vectors and the resulting scores are as 
illustrated in Table. 1 , the first selection is U2. When the number of the matching 
area is 2 , U2 and Oi can be selected by the score. In this method, the order of 
selection is 02, Oi, 03, 04 (see Table 1 ), and only the selected areas are used in the 
form classification phase. This method is called the Largest Score First method. 
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Table 1. An example of the disparity values and scores. 



area 


AB 


BC 


CA 


Score 


ai 


0.50 


1.00 


0 


0.50 


02 


1.00 


0.80 


0.13 


0.64 


03 


0.50 


0 


0.88 


0.46 




0 


0.20 


1.00 


0.40 



Maximal Disparity First(MDF) Method. As illustrated in Table. 1, the 
area 02 has three disparity values, 1.00, 0.80, 0.13 for AB, BC and CA form 
documents respectively. This means that the forms A and C can be distinguished 
well, but not the forms C and A in this local area. Therefore, if we select 02 as the 
first matching area, the next matching area can be selected so as to enhance the 
recognition of the CA form documents. In this case, 04 area can be the selection 
because in the areas C and A can be more clearly classified. Hence, the order 
of selection is 02 , 04 , Oi, 03 by this method . The method is called the Maximal 
Disparity First method. 



Genetic Algorithm Method. To determine the optimal matching areas with 
respect to the recognition rate and the computation time we use a Genetic 
Algorithm. The candidates for the matching areas are selected by using the 
score mentioned previously, and next the optimal matching areas are selected 
by a Genetic Algorithm. In the Genetic Algorithm, the optimal result can be 
produced by considering the recognition rate and the computation time. To 
compute the fitness of a gene during the generation, the following fitness function 
/ is used. 



where cr is a constant indicating the ratio of recognition rate to the computation 
time in the learning process, which satisfies 0 < cr < 1. The recognition rate 
obtained by a gene is indicated by the r, and t stands for the average time 
of computation to classify an input form. The maximal computation time to 
classify an input form by genes is represented by a T. 



5 Experiments and Results 

5.1 Experimental Environment and Contents 

We created a system in order to test the proposed method. Our system was im- 
plemented on Pentium PG(PII 366) using a G-l— I- language. In this experiment, 
a total of six types of credit card slips were used because they had a similar 
size and/or structure and have the same function. More specifically, the slips of 
Hankook, Union, E-Mart, Easy, Kiss, and Yukong were used. A total of 246 form 
images that contain six training forms and 240 test forms were scanned in a 200 




102 Y. Byun et al. 



dpi mode. The average size of the image of the experimental form documents 
was 826 X 1139 pixels. 

5.2 Partition of Forms and Computation of Disparity Scores 

In the phase of form partition, the vertical and horizontal separators are ob- 
tained. At first, the threshold for partition, dh and d^, are defined based on an 
experimental result. The dh stands for the distance between the two neighbor- 
ing vertical lines and starting/ending points of the horizontal lines, and the dy 
stands for the distance between the two neighboring horizontal lines and start- 
ing/ending points of the vertical lines. 




Fig. 4. Partition of forms and disparity score in each region 

Fig. 4 shows the result of a partition when the dh/dy, is 7/7, 8/8, 10/10, 12/12 
and 13/13 respectively, which are decided by experiments. Fig. 4 also shows the 
values of disparity score. The white area means that the disparity score in the 
area is close to 0, and a black area means that the disparity score is close to 1. 
A total of 2583 areas could be obtained if all of the vertical and horizontal lines 
which are away from each are separated. When the dy/dh is 13/13, 432 areas 
could be generated. 

Meanwhile, the computation time to partition a form is long when the thresh- 
old is small because the number of areas is large. Conversely, the computation 
time is short when the size of matching area is large as the number of matching 
area decreases. However, the length of the time increases again when the size 
of matching area is larger than a certain value by the DP matching due to the 
increased number of lines in a partitioned local area. 

5.3 Form Classification by a LSF Method 

Before computing the value of disparity for the corresponding parts in each 
member of a pair, form documents were partitioned with the threshold, dh/dy, 
which were as follows: 7/7, 8/8, 10/10, 12/12, 13/13 which were decided by 
experiments. All the values of disparity in the corresponding planes were used 
as elements of the disparity vector. The score was computed by summation and 
average of all the elements in the disparity vector. Finally, the 15 areas with 
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the highest scores were selected as matching areas, and the form classification 
was performed according to the number of the matching areas. Fig. 5 shows the 
results of recognition for the first five partition thresholds. 







Fig. 5. The recognition rate according to the number of matching areas 

Generally, the recognition rate is high if the size of a matching area is large 
and low if the size of matching area is small. Fig. 5 shows that the recognition 
rate is 95.29% in the case that the threshold is 7/7 with 15 matching areas. 

Fig. 5 shows that the second, fourth, and sixth matching areas increase the 
error rate although the scores in those areas are high because those area are 
filled with data or pre-printed texts. Fig. 5 also shows that the recognition rate 
does not easily reach 100% even if more matching areas are used because the 
previous matching areas that caused the errors are still in use. The reason for 
the decreased recognition rate in this case of using the second matching area is 
that the size of matching area is too small and the filled data are incorrectly 
recognized as lines. The filled-in data in the 5th and 6th areas, which were spots 
for signatures, were recognized as line. 

It was deduced from these experimental results that some matching areas 
could cause a poor result despite a high score in the areas. This is especially 
true when the size of matching area is relatively small, and when the filled data 
likes signatures are entered. On the contrary, the system can overcome the noise- 
induced error if the size of the matching area is relatively large, except that the 
computation time can be long. 

5.4 Form Classificatioii by an MDF Method 

Another method of selecting matching areas, called the MDF, was examined in 
the study. In this approach, the matching areas are selected so as to classify the 
form document that could be confused with already existing matching areas. 
Fig. 6 shows the recognition result. 

The first area is selected by the score, and the other matching areas are 
selected by the Maximal Disparity First method. As shown in Fig. 6, the result 
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Fig. 6. Recognition rate according to the number of matching areas 

of an MDF method is different from that of the previous method. When the 
score was used, the recognition rate vacillated significantly according to the 
increased number of matching areas. This is because the kinds of matching areas 
with superior classifying abilities for the mis-recognized forms were not selected 
beforehand. In fact, the Maximal Disparity First method shows a small change in 
the recognition rate and generally a better recognition rate than that by a score 
(compare Fig. 5 and Fig. 6). In both methods the recognition rate increases with 
the increase in the number of matching areas. When the threshold for partition 
is 13/13, a 100% rate of recognition is achieved by using only 9 matching areas 
which is 4% of the total size of a form image. 

There are several explanations for the incorrect rate of recognition. (l)Er- 
roneous lines can be extracted due to images of a poor quality. (2)Preprinted 
characters or user-entered data could be extracted as lines if the size of a match- 
ing area is too small. (3)Two or more form documents can be mis-recognized in 
a specific matching area due to similar form structures. To solve these problems 
we suggest the selection method with GA. 



5.5 Selection of the Optimal Matching Areas with GA 

In this method, the candidates of the matching areas are selected with the score 
and then the optimal matching areas are selected with GA automatically. The 
first step is to remove the redundant matching areas, and a total of 40 candidates 
are selected by a score. We could ascertain from the previous experiments that 
the maximal number of the matching areas is less than 20 to classify the input 
form document uniquely. A gene, therefore, consists of 20 bits, representing a 
range from the maximum of 20 matching areas to the minimal 1 area. 

In the GA operation, we used a 50% rate of selection, a 0.1% rate of crossover, 
and a 0.005% rate of mutation. The fitness is computed by the recognition rates 
and computation time, so the value of fitness is large when the recognition rate 
is high and the computation time is short. The fitness converges into a certain 
value when the generation is performed. As a result of form recognition, the rate 
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converges into 100% during the evolution process for all of the thresholds for 
partition. The speed of convergence can be either slow or fast. It becomes slow 
when gratuitous lines from noises or filled-in data are extracted and results in the 
incorrect recognition of the input with a high rate of error. On the other hand, 
when the error rate is small with respect to incorrect recognition, the speed of 
convergence becomes fast. 




Fig. 7. The fitness of genes during the learning process 

To select the optimal matching areas the following details were performed. 
First, a total of 30 filled images(5 images for each type of forms) were selected 
and used to the learning process. As a result of the process the genes with a 
high fitness remain alive at a certain generation. Next, form classification was 
performed by using the matching area represented by all of the genes alive. Fig- 
ure 7 shows the fitness which is computed when the cr is 0.8. During the evolution 
process, the fitness value converges into 0.962 after the eighth generation. 

The rate of recognition was measured by the summation and average of all of 
the recognition rates by genes at a generation. Figure 8 shows that the recogni- 
tion rate converges into a certain value during the learning process. In this case, 
100% recognition rate was acquired after eighth generation only. Interesting here 
is that the fitness at the second generation is not low in Figure 7 although the 
recognition at the same generation in Figure 8 is low. This is because the com- 
putation time needed to extract feature and to classify the input form is short. 
Consequently, we achieved a 100% rate of recognition after the 18th generation, 
when the average number of matching areas represented by genes was 5.2. The 
average time required to recognize a form was 0.76 seconds when the matching 
areas were used to match. 

6 Conclusion 



In this paper, we proposed a new method for processing form document efficiently 
using partial matching. Our method has feasible approaches for defining the 
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Fig. 8. The rate of recognition according to the generation 

matching areas that match. The redundant local areas, the areas containing 
filled data and noises are not selected so as to extract a good feature vector 
with respect to the recognition rate and the computation time. Searching and 
matching only a small number of structurally distinctive local areas yields a high 
rate of classification with a reduced processing time. 

From the experiments discussed in the previous chapter, the following mat- 
ters were known in detail: By using areas with large differences in structural 
information among the form documents to be processed, a good feature can be 
extracted and used as an input of classifier, which would enable the system to 
process an input form document with a high rate of recognition within a rea- 
sonable span of time. Moreover, the optimal matching areas can be selected by 
using a Genetic Algorithm, which enables the form processing system to process 
an input form in a reasonable time span. As a result, the redundant matching 
areas are not processed, a feature vector of good quality can be extracted fast, 
and an efficient form processing system that is applicable in real environment 
can be constructed. 
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Abstract. This paper investigates the use, for the task of classifier 
learning in the presence of misclassification costs, of some gradient de- 
scent style leveraging approaches to classifier learning: Schapire and 
Singer’s AdaBoost.MH and AdaBoost.MR [16], and Collins et al’s multi- 
class logistic regression method [4], and some modifications that retain 
the gradient descent style approach. Decision trees and stumps are used 
as the underlying base classifiers, learned from modified versions of Quin- 
lan’s C4.5 [15]. Experiments are reported comparing the performance, in 
terms of average cost, of the modified methods to that of the originals, 
and to the previously suggested “Cost Boosting” methods of Ting and 
Zheng [21[ and Ting [18], which also use decision trees based upon modi- 
fied C4.5 code, but do not have an interpretation in the gradient descent 
framework. While some of the modifications improve upon the originals 
in terms of cost performance for both trees and stumps, the compari- 
son with tree-based Cost Boosting suggests that out of the methods first 
experimented with here, it is one based on stumps that has the most 
promise. 



1 Introduction 

Much work within the field of machine learning focusses on methods for learning 
classifiers for attribute value data. The methods learn classifiers from examples 
of known class with attribute value descriptions, attempting to predict well the 
class of new examples from their attribute value descriptions. Although the most 
common goal is to learn classifiers with high accuracy, in which case all mistakes 
are considered equally bad, mistakes can have different degrees of significance, 
e.g. for the owner of a new car, paying a year’s theft insurance for a year in which 
the car is not stolen will (usually!) be a less expensive mistake, than not paying 
for the insurance in a year in which the car is stolen and not recovered. Thus some 
recent work has considered the issue of there being different misclassification 
costs associated with the different ways of misclassifying, e.g. [6] and [13]. In 
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some circumstances other forms of cost may be worth taking into account, see 
e.g. [22], but that is not pursued here, and henceforth costs will be assumed to 
be misclassification costs. 

Recently, in the cost context there has been interest in the approach of 
learning classifiers that use the predictions of many component classifiers, e.g. 
Breiman’s bagging [1] has been used for such problems in work involving the 
author [3], as have modifications of Freund and Schapire’s boosting [9] by Fan et 
al [8], previously by Ting and Zheng [21], and subsequently by Ting individually 
[17,18], and Ting and Witten’s form of Wolpert’s stacking [19,20,23] in work also 
involving the author [2] . This paper continues the theme of combined classifiers 
and costs, taking a further look at the boosting-style approaches, considering 
methods which are applicable to problems with two or more classes, not just 
two classes as considered in e.g. [8,17]. In common with much of the previous 
work, the paper assumes a misclassification cost matrix representation of costs: 
for each class, there is a positive cost for each other class, representing the cost 
of misclassifying an item of the first class as being of the second class. The cost 
of correct classification is zero. The matrix is assumed to be available at the time 
the classifier is learned, but many of the methods here could be used in circum- 
stances with differing costs per item, and only some need the cost information 
at learning time. 

The original Adaboost method [9] and many of the subsequent variations 
aimed at accuracy maximisation have been interpreted as a form of gradient de- 
scent minimisation of a potential function, as in e.g. [7]; however, the previously 
successful boosting-style methods applicable to problems with misclassification 
costs and possibly more than two classes [21,18], do not appear to be able to 
be interpreted in such a manner. This paper investigates experimentally the 
misclassification cost performance of some boosting-style methods previously 
proposed outside the cost context, and suggests some variations on them for the 
cost context while retaining the notion of the gradient of a potential function. 
The previous work on boosting-style approaches in the presence of costs has 
combined the predictions of many decision trees learned using (modified ver- 
sions of) C4.5 [15], and this paper follows this, basing the underlying learner on 
C4.5, either as in the previous work, growing (and pruning with C4.5’s standard 
criterion) a full tree, or restricting the tree grown to be a “stump” , a tree with 
only one decision node and the leaves immediately below it. While the use of 
stumps has been investigated in the accuracy context [16], it has not previously 
been considered for problems with costs. 

As Duffy and Helmbold remark [7], not all potential functions that have 
been used in variations on Adaboost lead to the formal PAC boosting property, 
and they use the term “leveraging” for the broader category of methods that in 
some sense leverage an underlying classifier learning method, hence our use of 
the term. However, where other authors have described their own method as a 
“boosting” method, even where it lacks the formal property, we may also use 
the term boosting. 
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The paper continues with a description of the methods considered, then the 
experiments and results, including a comparison with the previous methods, and 
finishes with some conclusions and suggestions for further work. 



2 The Methods 

This section describes the leveraging methods and variations that will be com- 
pared experimentally for misclassification cost performance. First the general 
framework will be outlined, then the specific methods within it. 



2.1 General Ftamework 

The general framework to be outlined here is based on the multiple class methods 
of Schapire and Singer [16], but restricted to the case where each item can have 
only one class. It leverages a learning method that will take a set of m instances, 
each of which has an attribute value description, a class label y in the set of 
possible labels L, and a set of weights, one for each possible class label, and return 
a classifier, a hypothesis h, which when given the description of an instance i 
and a class label I returns a prediction h{i, 1), in some cases this may be a -I-1/-1 
prediction, in others a real valued prediction. Each such hypothesis is given a 
weight a, in some cases a = 1. 

If a series of classifiers, hi,h 2 ,---ht have been formed, the sum of the 
weighted classifier predictions for item i, label I, is s(i,l) = 0- 

For a label I, the “margin” is [y = l]s{i, 1), where [. . .] is used to stand for -I-1/-1 
for the enclosed expression being true / false respectively. Thus a positive margin 
for a label corresponds to a correctly signed s(i, 1), thresholding about zero. 

The “gradient descent” view is based upon consideration of a potential func- 
tion expressing the current extent of training error as a function of the margins of 
the training examples. The leveraging process attempts to minimise the function 
in a series of steps, in each of which it attempts to learn an underlying classifier 
that approximates the direction of the (negative) gradient with respect to the 
margins, then takes an appropriately sized step in that direction by adding the 
classifier’s predictions (perhaps appropriately weighted) to the current combined 
classifier. (Note that some might consider the gradient descent perspective more 
specifically to apply to the two class case.) 

Given a potential function defined in terms of the s{i, 1) in such a way that 
the gradient of the potential function with respect to the margins can be found, 
the general leveraging process here consists of repeating for a number of rounds, 
the following steps for the jth round: 

1. Set weight for each item label pair {i, 1) to be the negative of the gradient of 
the potential with respect to the margin of (i, 1) 

2. Normalise the weights so that the sum of the weights is the number of items 

3. Learn hj using modified C4.5 

4. Calculate aj 
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When the learning process is completed the resulting combined classifier can 
be used to predict the class of a new item from the s’s for the new item’s different 
labels. 

The differences between the methods that are now to be described further is 
in the potential functions, the modification to the C4.5 learning and predicting 
process, the calculation of aj, and the way that the s’s are used to predict the 
label of a new item. These are now described for each of the methods that we 
consider. 



2.2 Schapire and Singer’s Multi-class AdaBoosts and Variations 

The AdaBoost.MH and AdaBoost.MR methods of Schapire and Singer [16] have 
not previously been tested in the cost context, and they and some modifications 
aimed at the cost context are the main methods examined here. Both meth- 
ods follow the previous boosting approaches in using exponential style potential 
functions, leading to exponential gradients and instance-label pair weights that 
can be updated simply (e.g. multiplicatively) in the implementation rather than 
calculating the gradient from scratch as the general framework might suggest. 
The original paper gives a fuller description than that here, though it is not 
approached from the gradient descent perspective. 

The original work evaluated the methods using decision stumps, of a slightly 
different form to the C4.5 stumps we consider here, e.g. when testing a multiple- 
valued discrete attribute, the stumps here form one leaf per value, and use the 
C4.5 approach to handling missing values, (splitting items amongst branches), 
whereas the original work formed a stump with three leaves, one for a value 
chosen to test against, another for all other known values, and the last for missing 
values. This work also considers full trees, not just stumps, as used by Schapire 
and Singer in the original work. 



AdaBoost.MH. AdaBoost.MH, (Multi-class Hamming) is based upon the po- 
tential function exp(— [y = l]s{i, 1)), reflecting a notion of Hamming 

style loss across instances and labels. Schapire and Singer suggest that for trees, 
each leaf can to some extent be considered as a separate classifier (making zero 
predictions on items that do not reach it) for the purposes of determining the 
appropriate prediction to make (i.e. step size to take) to minimise the potential 

function. The appropriate real valued leaf prediction for label I is | ln(|]^) where 

is the weight of items with class label I at the leaf and WL is the weight of 
items with other class labels at the leaf. (In practice to avoid potentially infinite 
predictions a small constant weight is added to both ITs). Leaf predictions of 
this form lead to an appropriate splitting criterion based upon the Ws, which 
has been incorporated into our modified C4.5 along with the leaf predictions. 
The leaf predictions in effect render the as redundant and they are set to 1. 
Predictions are made by the combined classifier by predicting the label with the 
greatest s{i,l), (with simplistic tie-break here as in subsequent methods). 
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AdaBoost.MR. In the single label case AdaBoost.MR (Multi-class Ranking) 
simplifies to AdaBoost.M2 [9], but this paper sticks with the term MR. The 
method is based upon the potential function exp(s(t, l) — s{i, y)), 

refiecting a notion of ranking loss with respect to the incorrect labels versus the 
correct label for each instance. 

Schapire and Singer do not suggest an appropriate real valued leaf prediction, 
just the use of -I-1/-1 predictions of [W!^ > W[_] for label 1. Leaf predictions of this 
form lead to an appropriate splitting criterion based upon locally maximising r, 
the weighted sum over instance label pairs of the correctness of the predictions 
(-1-1 for a correct prediction for the instance label pair, -1 for incorrect), and the 
splitting criterion and prediction method have been incorporated into our C4.5 
code. Here Schapire and Singer address the issue of the appropriate step size at 
the level of the entire classifier, with a being ^ ln(y 3 (!). Predictions are made by 
the combined classifier by predicting the label with the greatest s{i, 1). 



Non-uniform Initialisation. The AdaBoost.M methods are designed to at- 
tach equal importance to each training example, and thus take no account of 
the relative importance in cost terms of getting different items right. Previ- 
ous cost work, e.g. [12,18] has used non-uniform weight initialisation to get the 
non-uniform misclassification costs taken into account by the classifier learning 
method. In this previous work, the classifier learning has used only one weight 
per item, and hence in problems with more than two classes, the different costs 
of making different forms of errors for an individual item have not been able to 
be expressed in the weighting - each item has been weighted proportionately to 
the sum of the costs of the way in which it can be misclassified. As this method 
in effect collapses a cost matrix to a cost vector, it will be referred to as the 
“vector” approach of initialisation. 

A vector approach can be applied to the AdaBoost.M methods, by weight- 
ing each of the terms in their potential functions by the relevant costs. Let- 
ting Ci stand for the sum of the costs of the ways of misclassifying instance 
i, the potential functions become G = ^s{i,l)) (MH) and 

XX™ G 0 ~ s{i,y)) (MR). However, unlike the previous ap- 

proaches, the Adaboost.M methods offer the possibility of the different costs 
of the different ways of misclassifying the one item being reflected in the po- 
tential function and hence learning process. Letting C(^i,i) stand for the cost 
of misclassifying instance i as label I, the potential function for MR becomes 
XX™ XieL-i^i^y exp(s(z, ^) — s{i,y)). While an appropriate cost weighting 
of the instance-label pairs with incorrect labels in MH seems straightforward, 
that for the correctly labelled pairs is less clear. Here we define to be Ci for 
the correct labels in the potential function XX™ '^i^L 1)). 

(Alternatives for the correct label case could be investigated). 



Logistic-like Reinterpretation of the Outputs. Although there do not ap- 
pear to have been any experiments on the idea, Freund and Schapire [9] suggested 
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that the outputs of the two class Adaboost procedure could be used in a logis- 
tic form as probabilistic predictions, and Friedman et al [10] have also drawn 
the connection between logistic regression and Adaboost. Thus the possibility 
of using the s{i,l) in a multi-class logistic style approach is considered here, 
predicting the probability of an item i being of label I as exp{s{i i)) ^ These 



probabilistic predictions can then be used with a misclassification cost matrix to 
estimate for each class the expected cost of predicting that class, and then the 
class with least estimated expected cost can be predicted. 



Real Leaf Prediction for MR. The real predictions from a single round for 
the MH method can potentially be different for each leaf and label, whereas the 
predictions for the MR method are just plus or minus the classifier weight a. The 
principle by which the weight a was chosen at the top level can be applied instead 
at the leaf level, yielding a (potentially) different (plus or minus) prediction for 
each leaf, though this does not go so far as the MH case where there is also the 
variation by label. 



2.3 Logistic Regression Methods Based on Bregman Distances 



Schapire and Singer have, along with Collins, placed their previous work in 
a mathematical framework based around minimisation of Bregman distances 
[4], and extended the work to include some related methods based explicitly 
upon potential functions using logistic regressions. The multi-class form is based 
on the (negative) log-prob potential function — ln( 



exp{s(i,y)) 



■ 



exp{s(i,j))^’ ^ ® 

AdaBoost.MR, the function is based on the relative values of the s’s and similarly 
to the situation with MR, -I-1/-1 predictions at leaves can be used, with the same 
form of splitting criterion and the same form of r-based weight a; however, the 
calculation of the instance-label pair weights at each round is no longer quite 
as simple. The use of the probability estimates is straightforward, and all the 
relevant code has been incorporated into our C4.5 based implementation. 

The real leaf prediction mentioned for AdaBoost.MR can also be used with 
the logistic regression method. 



An MH-like Variation. Given the close connection between the differing 
methods, it seems natural to consider the possibility of a more MH-like form 
of the logistic regression method. A possibility examined here, based upon the 
method proposed by Collins et al for the two class case is to use the potential 
function ~ J2ZT J2jeL l+exp(-[l=J]s{^,j)) )’ is MH-like in considering 

separately the potential for the problems of predicting each of the different la- 
bels. This has been implemented in the C4.5 based code, using the corresponding 
real valued predictions from MH, and an appropriate multi-class style logistic 
prediction. 
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3 Experiments 



This section presents the results of the experiments on the methods previously 
described, examining in summary the relevant comparative performance of the 
different variations, and comparing in full some of the methods against the pre- 
viously successful “Cost Boosting” methods [21,18]. 

The commercial significance of true misclassification costs is such that, un- 
fortunately for work in the area, very few data sets are made publically available 
with such costs. Hence in some of our previous work [2], and that of others, e.g. 
[21,11], publically available data sets without given cost matrices have been used 
with a range of cost matrices generated for each data set, and the performance in 
terms of average misclassification cost being determined. For two class data sets 
the alternative approaches of examining ROC curves [14] or some other forms 
of cost curves such as those proposed by Drummond and Holte [5] would give 
a better representation of the results, but the methods do not scale simply to 
problems with more than two classes. 

The 16 data sets chosen in our previous work [2] for their variety of character- 
istics and use in others’ previous relevant work are used again here. A description, 
omitted in our previous work due to lack of space, is given in table 1. These are 
used as obtained from public sources except for the mushroom data set of which 
only a sample (10% rounded up) is used, as the full data set is uninterestingly 
straightforward for most methods. 



Table 1. Data Sets 



Name 


Instances 


Classes 


Attributes 

Discrete / Continuous 


Missing Values 
(%) 


Abalone 


4177 


3 


0/8 


0.0 


Colic 


368 


2 


15/7 


23.8 


Credit- Australian 


690 


2 


9/6 


0.6 


Credit-German 


1000 


2 


13/7 


0.0 


Diabetes-Pima 


768 


2 


0/8 


0.0 


Heart-Cleveland 


303 


5 


7/6 


0.2 


Hypothyroid 


3772 


4 


22/7 


5.5 


LED-24 


200 


10 


24/0 


0.0 


Mushroom 


813 


2 


22/0 


1.4 


Sick-Euthyroid 


3772 


2 


22/7 


5.5 


Sonar 


208 


2 


0/60 


0.0 


Soybean 


683 


19 


35/0 


9.8 


Splice 


3190 


3 


60/0 


0.0 


Tumor 


339 


22 


17/0 


3.9 


Vowel 


990 


11 


3/10 


0.0 


Waveform-40 


300 


3 


0/40 


0.0 
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As in the previous work, the random cost matrices are generated, like those 
of [21], to have zero diagonal elements as the cost of correct classifications is 
zero, one off-diagonal element is chosen at random to be 1, and the rest are then 
chosen uniformly from the integers 1 to 10. 

Each average cost reported is an average over ten experiments. Each experi- 
ment consists of randomly generating a cost matrix and determining the perfor- 
mance of each learning method using a ten-fold cross validation. The same cost 
matrices and splits of the data for the cross validation were used for all learning 
methods, hence the results for any two learning methods on a data set can be 
thought of as paired results, in that only the learning methods differed. 

When using leveraging methods, the issue of how many classifiers should be 
combined arises, as often performance will improve to a point then deteriorate 
with over-fitting. Most of the previous work with trees suggests that the signif- 
icant performance benefits arise in about the first 10 rounds. Here we run the 
methods for 30 rounds with trees, choosing on which round to make a prediction 
on the basis of an internal hold-out procedure. One third of the training data is 
kept aside and the method run on the remaining two thirds for 30 rounds, evalu- 
ating the performance at each round, on the hold-out one third, then the method 
is run on all the training data for 30 rounds with the prediction being made at 
the round that appeared best on the basis of the internal hold out procedure. 
The round that is used for different forms of prediction may vary between them, 
e.g. although we do not report the results here, we measured the accuracy per- 
formance of the original methods, and the best round for accuracy performance 
determined by the internal hold out procedure may well be different from that 
for cost performance of the original method, which again may differ from that 
for cost performance with a multi-class logistic style probabilistic prediction, etc. 
For the stumps a similar approach was used but the methods were run for 60 
rounds as the greater simplicity of stumps can cause more to be appropriate. 

To reduce the number of results to be presented to more manageable propor- 
tions, the results of many of the comparisons between pairs of similar methods 
will be considered here simply in summary form as win-lose-tie performances, 
i.e. on how many data sets one was better, on how many the other was better, 
and on how many they tied, (to the accuracy to which we later report fuller 
figures) . 

The first issue that we examine is the use of the cost based weighting with 
the AdaBoost.M methods. We compare the vector style weighting with the raw 
unweighted methods, and the matrix style weighting with the vector style. The 
aggregate results over the MH and MR methods and over trees and stumps for 
each of these, i.e. 4 results for each dataset, show vector strongly improving upon 
raw by 56-4-4. The corresponding figures for the comparison between matrix and 
vector, ignoring the two class data sets, on which the methods are the same, are 
28-6-2, fairly strongly supporting the view that the multiple weights per instance 
of the methods can be advantageous for the cost based weighting approach. (The 
artificial LED and waveform data sets cause most of the wins for the vector 
method, but it is not clear whether there is something interesting in this.) 
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The second issue that we examine is the use of the logistic style probabilistic 
predictions to make least expected cost predictions with the AdaBoost.M meth- 
ods, instead of the simple original approach to predicting. Aggregating again 
over MR and MH and trees and stumps, shows the probabilistic predictions 
ahead by 58-3-3, a strong indication of improvement. 

Given the two previous results, the question of which improvement upon the 
base approaches is the better, the matrix based weighting or the logistic style 
probabilistic predictions, arises. Aggregating over MR and MH and trees and 
stumps suggests some overall advantage to the logistic style predictions 36-23-5. 

Using this logistic style prediction, there appears to be no genuine advantage 
to the real leaf prediction instead of the -I-1/-1 for the tree and stump MR 
approaches, ahead only 17-15-0. 

A similar comparison of the explicit logistic regression method with Ad- 
aBoost.MR suggests a possible small advantage 21-10-1, though as this is com- 
posed of 14-1-1 for trees and 7-9-0 for stumps, there is a possibility that there 
is a dependency on the underlying classifier in there. 

The use of the real leaf prediction with the logistic regression method is 
perhaps slightly ahead, 20-10-2, of the -I-1/-1, and the use of the more MH style 
potential function yields very similar overall performance to the original, 16-15-1 
by comparison. 

Thus overall the clear advantages by comparison against similar methods 
are to the matrix weighting method over the vector, and the use of logistic-style 
predictions for the AdaBoost.M methods instead of the simpler original method. 
The other approaches may be worth evaluating in individual applications, but 
do not appear to be major improvements overall. 

The final question is how the methods compare against the previously pro- 
posed successful “Cost Boosting” methods of Ting and Zheng [21] and Ting 
[18]. Here we give the full results for some of the previously compared meth- 
ods, focussing in terms of Adaboost on the MH method, which was originally 
suggested to be slightly better overall by Schapire and Singer [16]: MHMatT 
(AdaBoost.MH with matrix style weighting, using trees), MHMatS (as previous 
but stumps), MHPrT (AdaBoost.MH with logistic style probabilistic predictions, 
using trees), MHPrS (as before using stumps), LgPrT (Logistic regression with 
probabilistic predictions, using trees), LgPrS (as before using stumps), CBTZ 
(Ting and Zheng’s method), CBT (Ting’s method). Note that the “Cost Boost- 
ing” methods only use trees as they have not been designed to work with stumps. 
Table 2 shows the average misclassification costs per instance of these methods. 

A comparison of the methods using the full trees shows that while the cost 
matrix style weighting and logistic style probabilistic predictions have been 
shown to improve upon the basic AdaBoost.MH method, they are inferior over- 
all to the previous cost boosting methods. However, the use of Adaboost. MH 
with stumps and logistic style probabilistic predictions, while frequently produc- 
ing very different results to the previous methods has comparable performance 
overall to each of the previous methods, being ahead of each in 12 domains, and 
marginally in front in terms of the geometric mean of the ratios of average costs. 
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Table 2. Full cost results for some methods 



Data set 


MHMatT 


MHMatS 


MHPrT 


MHPrS 


LgPrT 


LgPrS 


CBTZ 


CBT 


abalone 


2.393 


1.880 


2.060 


1.691 


1.846 


1.705 


1.798 


1.785 


colic 


0.555 


0.531 


0.598 


0.531 


0.573 


0.570 


0.616 


0.626 


credit-a 


0.389 


0.317 


0.379 


0.317 


0.444 


0.317 


0.337 


0.423 


credit-g 


0.708 


0.606 


0.729 


0.610 


0.740 


0.598 


0.608 


0.618 


diabetes 


0.668 


0.537 


0.747 


0.542 


0.684 


0.497 


0.554 


0.572 


heart 


2.501 


2.088 


2.281 


1.947 


2.295 


1.845 


2.171 


2.198 


hypothyroid 


0.024 


0.024 


0.025 


0.024 


0.024 


0.046 


0.021 


0.024 


led 


2.030 


1.590 


1.874 


1.520 


1.800 


1.431 


1.654 


1.879 


mushroom 


0.013 


0.018 


0.015 


0.020 


0.041 


0.022 


0.021 


0.016 


sick 


0.030 


0.055 


0.030 


0.052 


0.030 


0.046 


0.031 


0.026 


sonar 


0.366 


0.383 


0.346 


0.386 


0.377 


0.322 


0.529 


0.491 


soybean 


0.400 


0.273 


0.397 


0.225 


0.361 


0.414 


0.350 


0.300 


splice 


0.163 


0.148 


0.161 


0.138 


0.161 


0.162 


0.198 


0.228 


tumor 


3.201 


2.929 


2.681 


2.468 


2.647 


2.438 


2.698 


2.987 


vowel 


0.427 


0.966 


0.423 


0.975 


0.186 


2.111 


0.766 


0.484 


waveform 


1.168 


1.029 


1.160 


0.972 


1.041 


0.799 


1.068 


1.072 



Thus the appropriate gradient descent based method with stumps is compet- 
itive with the previous tree based approaches that do not fit the gradient descent 
framework. A check on accuracy performance suggests that this is not simply a 
matter of stumps being better than trees in all ways for the data sets used, as 
AdaBoost.MH’s overall accuracy performance is better with trees than stumps, 
so the advantage of stumps lies in their suitability for the method of making 
probabilistic predictions, while not being superior in simple accuracy terms. (A 
check on the number of classifiers chosen by the internal validations suggests 
that the stump based approaches might benefit from more rounds in some cases, 
especially on the vowel data set where the average number of classifiers chosen 
is close to the maximum possible - some further experiments will be conducted 
on this.) 

4 Conclusions and Further Work 

This paper has examined the use in the misclassification cost context of some 
gradient descent leveraging methods using decision trees and stumps as the un- 
derlying base classifiers, experimentally comparing on 16 data sets the cost per- 
formance of the original methods and some modifications of them, and previous 
cost boosting approaches. The results show that the use of multiple weight per 
item methods enables the use of a more matrix style weighting method that 
performs better than previous weighting methods that collapsed the matrix to 
a vector. The results show that the use of a multi-class logistic probabilistic 
prediction from the leveraging methods performs better than the simple original 
prediction methods intended for the accuracy context. When compared against 
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previously proposed cost boosting methods using trees, the performance of one 
of the new stump based methods is competitive overall in cost terms. Accu- 
racy performance results of the same method with trees suggest that the use of 
stumps may be particularly suited to the probabilistic prediction approach, as 
the trees perform better in accuracy terms. 

Given the gap in performance between the tree based gradient descent ap- 
proaches and the previous cost boosting approaches, an interesting possible di- 
rection to pursue seems to be in the creation of potential functions that better 
reflect the cost task in some way, and we are looking at some possibilities of 
this form at present. Given our previous results on stacking different forms of 
classifier [2], another fairly straightforward issue to investigate is whether the 
successful boosted stumps method constitutes a useful base classifier for the 
stacking approach. 

Although this paper has put forward some successful cost based modifications 
to previous gradient descent leveraging methods, and experimentally demon- 
strated their potential, there are still some interesting possibilities to pursue in 
this area. 
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Abstract. We present a framework for backward and forward temporal 
projection that combines dynamic and temporal logic. A propositional 
dynamic logic for reasoning about actions is extended with temporal 
modalities; the syntax of this extension differs from the syntax of the 
converse of programs, previously understood as backwards modalities. 
An application is carried out to benchmark postdiction examples such 
as the Stanford Murder Mystery [1] and the Two-Buses Problem [8]. A 
method for automatically generating frame axioms is used; the axioms 
so generated are treated as supplementary axioms in the associated 
proof theory. In future work, we hope to embed this system into a more 
comprehensive logic for reasoning about actions that enables a unified 
treatment of the frame, qualification and ramification problems and to 
work with more ‘scaled-up’ examples. 

Keywords: temporal reasoning, commonsense reasoning 



1 Introduction 

In devising logical frameworks for temporal projection, our intuition is that when 
we inspect the current state of the world, we can reason about what states of 
affairs would obtain on the performance of a particular action and about what 
actions could have realized the current state (the latter is the classical form 
of the explanation problem). A logical framework, then, for temporal reasoning 
should be truly general. It should take the entire time line into consideration (not 
just a single step in the time line) and facilitate inferences about various target 
points in the time line, from partial information about other source points. When 
the target points are ahead of the source points in the time line, the inference 
problem is that of prediction] when it’s the other way around, the problem is 
that of postdiction. These are vital components of commonsense reasoning: the 
latter requires reasoning from effects to causes and poses a special challenge for 
logical formalisms. Shanahan [17] provides a good discussion of the problems 
created by benchmark examples such as the Stanford Murder Mystery [1] and 
the Two-Buses problem [8]. 

Dynamic logic as a formalism for reasoning about action was proposed twenty 
years ago [7,16,9]. However, classical dynamic logic as originally proposed [14] 
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can only directly express reasoning forward in time. For instance, [a]A means 
A is true after performing action a. If we view [a] as a temporal modality, 
some properties of [a] are identical to those of the future modality [F]. More 
precisely, [a] acts like a “nearest future”: a combination of the ‘Future’ [F] and 
‘Next’ O modalities of temporal concurrent logic [13]. To reason backwards in 
dynamic logic, one can extend classical dynamic logic with inverse operations as 
in [11] (subsequently. Converse PDL): a~^ converts an action into its inverse. 
So to retrodict the state before an action is performed, we would perform the 
inverse action. Alternatively, we suggest, an accompanying modality to [a] can 
be introduced, which paralleling the role of [P] in temporal logic, specifies the 
previous state before the performance of a. Syntactically these two approaches 
are different because Converse PDL extends the language by expanding the set 
of programs while our suggested alternative adds a new modality. They are also 
different in both methodology and ontology. The second approach allows us to 
reason about the past directly rather than indirectly via reversing changes to 
recover previous world states. More significantly, not all actions are coherently 
understood as being reversible whereas previous worlds always exist and so we 
need, and should be able to, reason about them. 

In this study, we supplement the classical dynamic logic PDL with a temporal 
modality: a construct that allows us to speak coherently of fluents being true 
in previous or temporally prior, situations or states. In normal temporal logics, 
the modal operator [P] is used to express the necessary truth of propositions in 
the future. So, [F]A means that the proposition A must be true at all future 
times. The dual operator {F) expresses the possible truth of propositions in the 
future, that is, {F)A means A is true at some time in the future. Similarly, 
[P] and (P) express the necessary and possible truth of propositions in the 
past. The relationship between these modalities is succinctly expressed by the 
following axiom schemes (these are precisely the axiom schemes used in PDL 
with program converses [11] with F,P replaced respectively by a,a“^): 

- Cp: [P](F)A 

- Cp: [F](P)A 

It is easy to see that the modal operators [F] and [F] are each the converse 
of the other. So, we could write [F] as [F]“^ in order to make the relationship 
between [F] and [F] more visible. These temporal modalities specify the state of 
a system in the entire future or entire past. A more refined temporal operator 
is O (Next) (considering time to be made up of discrete points). would 
then mean “A is true in the next step”; Q)~^A would mean “A is true in the 
previous step” . However, these modalities are purely temporal. They do not say 
why a proposition is true. So, to express the effect of an action, we can use the 
language of dynamic logic as follows: 

- [a]A : A must be true after performing action a. 

Note that [a] A not only expresses the fact that A becomes true as the effect of 
action a but it also expresses the temporal relation between the current instant 
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in time and the time after performing a. A similar temporal concept could be 
introduced as follows: 

— [a]~^ A : A must have been true before performing action a. 

The dual modalities can then be defined: 

— (a) A: a is executable and A could be true after performing a. 

— (a)' -U: a is executable and A could have been true before performing a. 

For example, [turn-of f]^light means that the action turn-off causes the 
light to be turned off. It also says that the light will be off after performing 
the turning off action. Similarly, [turu-O f f]~^ light means “the light was on in 
all worlds before the the turn-off action was taken” and {turn-of f)~^light 
means “the light was on in some worlds before the the turn-off action was 
taken”. The connection here between forward and backwards modalities and the 
postconditions and preconditions of actions should be apparent and intuitive. 

We now present an intuitive semantics, supply an accompanying axiomatic 
deductive system and apply the framework to standard examples. 



2 The Logic TPDL 

In dynamic logic a modal connective is associated with each command a of a 
programming language, with the formula [a\A being read “after a terminates, 
A must be true.” The dual operator (a) of [a] is an abbreviation of 
{a) A means “there is an execution of a that terminates with A true”. With the 
compound operators ;,U,*, complex programs can be generated from atomic 
programs. If a, (3 are programs, a; /3 means “execute a followed by /3”, a U /? 
means “execute either a or P nondeterministically” , a* means “repeat a finite 
times nondeterministically” . A? is a special program, meaning “test A proceed 
if A is true, else fail” [7] [5] . 

2.1 The Language 

The alphabet of the language Ltpdl consists of countable sets Flu, Actp of flu- 
ent and primitive action symbols respectively. Formulas (A S Fma) and actions 
(a G Act) are defined by the following BNF rules: 

- A ::= / I -A I Ai ^ A2 I [a]A | [a]-iA 

— a ::= a I oi; «2 | oi U «2 \ ex* | A' where / G Flu and a G Actp 

(a)T means “a is executable”. The definitions of T,T,V,A,^ are as usual. A 
literal is a fluent or its negation. The set of all literals in Ltpdl is denoted Flu^. 
A formula without modalities, termed a proposition, is denoted by ip,ipi,ip 2 
and Ip . The dual modal operators of [a] A and [a]“^A are defined as usual: 
{a) A =def -^[a]^A and {a)~^A =def -^[a]~^^A. 
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There are subtle differences between TPDL and Converse PDL. is not 

equivalent to Converse PDL uses inverse actions of the form a~^. TPDL 

uses only forward actions, and has the backward temporal operator 
. TPDL is less expressive than Converse PDL. For instance, = 

is expressible in Converse PDL but not in TPDL. We introduce 
the following notation: 

— ([o])^ =def (a)T A [a\A, read as “a is executable and A will be true after 
peforming a”. 



2.2 Semantics 



The semantics of TPDL is a standard Kripkean semantics for PDL (see [5] for 
standard introductions) plus an added accessibility relation. A model is a triple 
M = (VF, {R^ : a € Act} U {R^ : a G Act}, V), where R^ and R^ are binary 
relations on W, and y is a function from Flu to 2^ (or equivalently, a two- 
place relation on Flu x W). Intuitively, performing an action takes the agent 
from one world to another (a state transformation). For instance, (rui, W 2 ) G R^ 
means that if the current world is Wi , then after performing action a the world 
will evolve to state IV 2 . (wi, W 2 ) G R^ means that if the current state is wi, 
then before performing the action a the world was W 2 . The satisfaction relation 
M |=u, A is defined as follows: M / iff / G F(/) for any / G Flu. We then 
have the following: 

— M \=yj [a]A iff Vic' G W, (wR^w' M \=yji A). 

— M iff Vw' G W, {wR^w' M \=yji A). 

As usual, A G Fma is valid in a model M written M ^ A if M \=.ui A,yw G W. 
^ A means A is valid in all models. Standard models are those in which the 
binary relation R^ has the intended meanings of programs or actions a with 
the added condition that R^ = (i?^)”^ (as in [11]). The following hold in any 
standard model: 



^a‘j3 -^a ® ddp 

RaUfS — -Ida U Rp 

- Ra* = R*a 

- Ra? = {{w,w) : M \=^ A} 



(Composition) 

(Alternation) 

(Iteration) 

(Test) 



2.3 A Deductive System for TPDL 

The deductive system for TPDL consists of the following axiom schemes and 
inference rules (where A, B G Fma, a G Act): 

1. Axiom Schemes 

— All tautologies of propositional dynamic logic. 

— Axiom schemata analogous to those for normal temporal logics: 

• Cp: A^ [a]~^{a)A 
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• Cf: [ct](a) 

- FK: [a]{A ^ B) ^ {[a]A [a]B) 

- PK: [a]-\A ^ B) ^ {[a]-^A [a]~^B) 

2. Inference Rules 



FN: 



A 

[a] A 

PN: ^ 



MP: 



[a]~^A 

A,A^B 

B 



Provability in TPDL is denoted h. We note the presence of the new rules 
FK,PK,FN, PN which are extensions of the K,N rules of standard PDL. 
The following lemma is an obvious consequence of our definitions: 



Lemma 1. 

1. h {a)~'^[a]A A 

2. h ^ A 

3. h (a)-^T A ^ {a)~^A 

4. b [a]-^(2l ^B)^ i{a)-^A ^ {a)-^B) 



Theorem 1. The following are theorems of TPDL: 

- [a] 13]-'^ A ^ [l3\-'^[a]-'^ A, 

- [a\J 13\~^ A ^ [a]~^A f\[!3]~^ A, 

- \a*]~'^A^ Ah[a*]~^[a]~^A, 

Soundness and completeness of the deductive system above is easily shown: 

Theorem 2. (Soundness and completeness of TPDL) A is valid in all standard 
models of Ctpdl if and only if\~ A. 

Decidability in TPDL has a time complexity similar to that of PDL: 
Proposition 1. Validity in TPDL is decidable in exponential time. 

3 Reasoning with Action Descriptions in TPDL 

Logics of action typically aim to provide action descriptions for describing the 
effects of actions and the causal relationships in a system, and inference schemes 
that enable the prediction of the effects of actions or explanations of observed 
phenomena. We describe these components in turn within the context of our 
framework. An action description for a dynamic system is a finite set of for- 
mulas, which specifies actions in the system and includes information about 
the direct and indirect effects of actions in the system. A formula in an ac- 
tion description is significantly different from an ordinary formula. The sen- 
tence loaded [Shoot]^alive means that whenever loaded is true (i.e., the 
fluent loaded holds), performing the action Shoot will result in a situation or 
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state in which the value of the fluent Alive is -^Alive. In the situation cal- 
culus a similar statement would read ys{loaded{s) ^alive{do{Shoot, s))). 
To recreate the same situation in dynamic logic, an extra modality any is 
used ([15,2]) to denote “any action”. This lets us write expressions such as 
[ any] (loaded ^ [Shoot]^alive; such a modality is an S 5 modality. Introducing 
the any modality formally would involve adding all S '5 axioms and an extra ax- 
iom ([any]A ^ [a]A). However, rather than introducing extra modalities which 
would make the system cumbersome, we use the techniques of [19], in which 
an action description is treated as a set of axioms. These function like domain 
axioms in the situation calculus and lets us define the notion of S -provability. 

Definition 1 . Let S he a finite set of formulas. A formula A is a U -theorem, 
denoted \~^ A if it belongs to the smallest set of formulas which contains, all 
theorems ofTPDL, all elements of E and is closed under MP,FN and PN. 

For any P C Fma, a sentence A is 27-provable from P written P A if 
3Ai,...An G P such that Ai {...{An H)...). P is if-consistent if 

P\f^ _L. 

Definition 2. Let E he an action description. A standard model M is a E - 
model if M \= A for any A G E. A is E-valid i.e., \=^ A if it is valid in every 
E-model. 

Soundness and completeness is then easily shown: 

Theorem 3. E-provability is sound and complete: \~^ A iff \=^ A 

3.1 Flame Axioms 

A reasonable requirement of any framework for reasoning about action is that 
it provide a solution to the frame problem. The following simple solution is 
proposed in [3] {L is a literal below) . 

Definition 3. A formula of the form (p A L ^ [a]L is a frame axiom. 

Definition 4. Let E he uniformly consistent (i.e., 1/^ E) and P be E -consistent 
(i.e., P 1/^ E). Por an arbitrary formula A, and set A of frame axioms, such 
that P is E\J A-consistent and P A then A is E -provable from P with A, 

denoted by P A. The elements of A are termed supplementary axioms. 

In [4], it is shown that P A if and only if there is a set A' of frame axioms 
which only contain the symbols occurring in P, E and A such that P A 
(i.e., P A is reduced to T A) (some extra conditions are required to 
make this true, see [4]). Moreover, it is shown that if E is in normal form^, A' 
can only contain the symbols in P and A. This means that for prediction and 
postdiction, there is no frame problem (representationally) if we postpone listing 
frame axioms till a query is made. 

^ The following kinds of formulas are said to be in normal form: 

— \p\L (causal law) 
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3.2 Prediction and Postdiction 



With Z'-provability, simple and desirable inferences involving prediction and 
postdiction become possible. 



Example 1. Consider the Yale Shooting Problem [6]. Let Flu = {alive, loaded} 
and Actp = {Load, Shoot, Wait}. This problem is described by the following 
action description: 



Y = 



^loaded [Load]loaded 
loaded [Shoot]^alive 
loaded [Shoot]^loaded 
{Load)T 
{Wait)T 

loaded ^ {Shoot)T 



The first three sentences state the effects of the actions Load and Shoot. The last 
three state the feasibility of Load, Wait and Shoot. Among them, ^loaded 
{Load)T says that Load is performable if the gun is not already loaded. The 
action Shoot can cause ^Alive only if the gun is loaded. 



We can prove that {^loaded} [Load; W ait; Shoot]^alive (i.e., that the 
sequence of actions Load, Wait, Shoot results in ^alive as desired), where A = 



{loaded [Waitjloaded} in the following: 

1*. ^loaded [Load\loaded (AD) 

2*. loaded [Waitjloaded (FA) 

3*. loaded [Shoot]^alive (AD) 

4*. [Load]loaded ^ [Load; W ait]loaded (2,FN) 

5*. [Load; Wait]loaded [Load; Wait; Shoot]^alive (3,FN) 

6. [Load; W ait; Shoot]^alive (T, 1,4,5) 



where AD indicates “Action Description in A”; FN is an inference rule of 
TPDL; r represent the premises. means the formula is a A -theorem, so 
we can use the inference rules FN and PN . FA stands for “frame axiom”. We 
can also prove that [Shoot[~^loaded (i.e., that preconditions for actions can 



he inferred using past modalities) . 




1*. [Shoot]~^ {Shoot)T [Shoot]~^ loaded 


(PN) 


2*. [Shoot[-^{Shoot)T 


(Cp) 


3*. [Shoot[~^loaded 


(1,2) 


— (fi ^ [a[L (deterministic action law) 

— ip a)^ L (non-deterministic action law) 

— (fi ^ {a)T (qualification law). 





where <p and tp are propositional formulas, L is a literal and a is a primitive action. 
An action description S is normal if each formula in E is in normal form. 
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4 Application to Benchmark Examples 

4.1 Stanford Murder Mystery 

As demonstrated, the Yale Shooting Problem (as a prediction problem) is triv- 
ially handled. A far more interesting case is the Stanford Murder Mystery, which, 
as pointed out in [1] created problems for classical non-monotonic approaches 
(specifically, circumscription [10] and chronological minimization models [18]) 
and demonstrated the role of backward temporal projection. The status of the 
gun is unknown in the initial situation while it is known that the victim is initally 
alive and then is shot dead after a Shoot action followed by a Wait action. 

Example 2. (Stanford Murder Mystery) Let Flu = {Alive, Loaded, Walking} 
and Actp = {Load, Shoot, Wait}. The action description is the one in Ex- 
ample 1. We want to be able to derive {[wait]~^[shoot]~^ alive, ^alive} \~^ 
[wait]~^[shoot]~^loaded. That is, we would like to draw the inference that the 
gun was loaded in the initial situation followed by the indicated sequence of 
actions above leading to the unfortunate victim’s death. 

As pointed out in [1], standard circumscription policy is unable to draw this 
conclusion-it cannot sanction the inference that the gun was initially loaded^. 
The chronological minimization model fails as well-in delaying abnormalities to 
fluents and postponing the victim’s death to the waiting part of the scenario, it 
ends up concluding that the gun must have been unloaded. In contrast, in our 
system, we are able to formally prove the desired conclusion. 

Proof for {[wait]~^[shoot]~^alive,^alive} [wait]~^[shoot]~^loaded; in the 
following, the letter L stands for Lemma) 



1*. 


alive [wait\alive 


(FA) 


2*. 


[wait]~^ {alive ^ [wait]alive) 


(1,PN) 


3*. 


{wait)~^ alive {wait)~^ [wait] Alive 


(2,L 1:4) 


4*. 


{wait)~^ alive alive 


(3,L 1:1) 


5. 


~^{wait)~^ alive 


(A, 4) 


6*. 


alive A ^loaded [shoot]alive 


(FA) 


7*. 


[shoot]~^ {alive A -^loaded [shoot]alive) 


(6,PN) 


8*. 


(shoot) ^ (alive A -'loaded) ^ (shoot) ^[shoot]alive (7,L 1:4) 


9*. 


(shoot)~^(alive A -'loaded) alive 


(8,L 1:1) 


10*. 


[wait]~'^ {{shoot)~'^ {alive A -^loaded) —> alive) 


(PN) 


11*. 


{w ait) {shoot) {alive A -^loaded) {wait)~ 

La ^{wait)~^ {shoot)~^ {alive A -^loaded) 


'^alive (10, L 1:4) 


12. 


(5, 11) 


13. 


[wait]~^[shoot]~^^{alive A -iloaded) 


(12) 


14. 


La [wait]~^[shoot]~^ loaded 


(r,13,PK) 


where A = {alive [wait]alive, alive A ^loaded — 


^ [shoot]alive} 



Our framework possesses an interesting feature vis-a-vis the existence of 
frame axioms. In addition to the standard frame axioms of the form “taking 

^ [1] then goes on to provide a circumscriptive formalization of the Stanford Murder 
Mystery. 
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action a has no effect on ffuent A” we can generate axioms of the form “taking 
action a had no effect on ffuent A in the past” . 

Proposition 2. 

— {(p A A ^ ^ {if A ^A 

— ((/? A ^A ^ (ip A A ^ 

Proof: Immediate from definitions and axiom schemes. □ 

The proposition above says that whenever we have a frame axiom of the form 
p A A ^ [a]A, there exists one of the form p A ^A [a]~^^A] for every frame 
axiom of the form pA^A there exists one of the form pAA [o;]“^A. 

These axioms can be used for shorter proofs of postdictive inferences. 



4.2 The Two Bus Problem 



We now demonstrate the efficacy of our approach by considering an example 
[8] that proves problematic for state-minimization [1] and narrative based ap- 
proaches [12] ). In the example below, the minimal models obtained by the 
circumscription policy of the state minimization models are adversely affected 
by the observation axioms (since non-deterministic actions are involved). 



Example 3. There are two buses that take the agent to its workplace. The 
buses are of different colors: one red, the other yellow. To get to work, the 
commuter takes the first bus that comes by. It can board either bus only if it 
has a ticket. We use the fiuents HasTicket.OnRedBus and OnYellowBus to 
express the state of the commuter. We consider two actions: BuyTicket and 
GetOnBoard. This scenario is specified by the following action description: 
^{OnRedBus A OnYellowBus) 

OnRedBus HasTicket 
OnYellowBus HasTicket 
y; = ^ -^HasTicket —> [BuyTicket]HasTicket 

[GetOnBoard]{OnRedBus V OnYellowBus) 

HasTicket A -^OnRedBus A -^OnYellowBus {GetOnBoard)T 
{BuyTicket)T 



In the state minimization model, the erroneous deduction that the agent is 
on the red bus immediately after buying the ticket (without taking the action 
GetOnBoard) is not blocked. In contrast, our system allows the desired infer- 
ences and blocks the undesirable ones. 

We can prove: 



{^HasTicket, -^OnRedBus, -^OnYellow} {[BuyTicket-, GetOnBus]){OnRedBusV 
OnY ellowBus) 

where A = {-^OnRedBus [BuyTicket]^OnRedBus, —'OnYellowBus 

[BuyT icket]-'OnY ellowBus} . 

However, we can never prove that 
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{^HasTicket,^OnRedBus,^OnYellow} {[BuyTicket])OnRedBus 
or 

{-^HasTicket, -^OnRedBus, ^OnYellow} {[BuyTicket; GetOnBus])OnRedBus 
no matter what frame axioms are used in A. 

Note too, that we can prove: 

{OnRedBus} [GetOnBus]~^ {-^OnRedBus) 

but 

{OnRedBus} 1/^ [BuyTicket]~^ {-^OnRedBus), 

which means if the passenger is on the red bus currently, before it got on the 
bus, it must not have been on the red bus, while buying a ticket can never cause 
the commuter to be off the bus. Furthermore, given that the passenger is on a 
bus (either one) we can draw the true inference that the commuter performed 
the action of boarding the bus, preceded by buying a ticket, before which the 
commuter was not on either of the buses: 

{OnRedBus V OnYellowBus} 

[GetOnBoard\~^[BuyTicket]~^{^OnRedBus A ^OnYellowBus). 

We cannot strengthen the conclusion with the added conjunct ^HasTicket since 
buying a ticket is always possible even if the commuter already has a ticket. We 
can also infer that in order to board the bus, the commuter must have a ticket 
and must not be on either of the two buses: [GetOnBoard\~^ {^OnRedBus A 

HasTicket A ^OnYellowBus). 



5 Conclusion 

In this paper, we have presented a framework for backward temporal projection 
that combines two existent frameworks for reasoning about action: dynamic logic 
and temporal logic. We have demonstrated the intuitive plausibility of the sys- 
tem by its application to standard examples in the literature and demonstrated 
its formal elegance and soundness. More work remains to be done however: em- 
bedding this system into a more comprehensive logic for reasoning about actions 
will enable a unified treatment of the frame, qualification and ramification prob- 
lems. We would also like to apply the system to more ‘scaled-up’ examples. At 
a purely formal level, investigating the interaction of the iteration construct of 
standard PDL with the backward modalities of TPDL is an interesting avenue 
of research. 
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A Method for Reasoning with Ontologies Represented as 

Conceptual Graphs 



Dan Corbett 
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Abstract. This paper discusses automated reasoning over ontologies, 
represented as Conceptual Graphs. We have designed and implemented 
reasoning tools using Conceptual Graphs as the underlying knowledge 
structure. This work demonstrates that the power of logic as implemented in 
Conceptual Graphs, and the tools available in Conceptual Graph Theory can be 
used as powerful ontology reasoning tools in a real-world domain. We show 
that ontologies can be constrained and unified using efficient methods, and that 
these methods provide the basis for an automated reasoning system. The 
Conceptual Graph techniques of concept join, partial order and subsumption are 
all exploited to create these reasoning tools. 

We dicuss the implementation of our ideas, and demonstrate the reasoning 
tool that we created in two domains: building architecture and defence. The 
significance of our work is that the previously static knowledge representation 
of ontology is now a dynamic, functional reasoning system. 



1. A Brief Overview of Conceptual Graphs 

Conceptual Structures (or Conceptual Graphs, or “CGs”) are a knowledge 
representation scheme, inspired by the existential graphs of Charles Sanders Peirce 
and further extended and defined by John Sowa [19-21]. Informally, CGs can be 
thought of as a formalization and extension of Semantic Networks, although the 
origins are different. They are labeled graphs with two types of nodes: concepts 
(which represent objects, entities or ideas) and relation nodes, which represent 
relations between the concepts. As an example, Figure 1 shows a Conceptual Graph 
which represents the knowledge that “The cat Felix is sitting on the mat which is 
known as mat 47.” 

Every concept or relation has an associated type. A concept may also have a 
specific referent or individual. A concept in a CG may represent a specific instance 
of that type (e.g., Felix is a specific instance, or individual, of type cat) or we may 
choose only to specify the type of the concept. That is to say that a concept may 




Figure 1 . A Simple CG. 
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simply represent a generic concept for a type, such as mammal or room, or a concept 
may represent a specific object or idea, such as my cat or the kitchen at the Smith’s 
house. In the former case, the concepts in Figure 1 would be shown as “cat: * ” and 
“mat: * ” indicating non-specified entities of types cat and mat. In the standard 
canonical formation rules for Conceptual Graphs, unbound concepts are existentially 
quantified. 

A relation may have zero or one incoming arcs, and one or more outgoing arcs. 
The type of the relation determines the number of arcs allowed on the relation. The 
arcs always connect a concept to a relation. Arcs cannot exist between concepts, or 
between relations. 

A canon in the sense discussed here is the set of all CGs which are well-formed, 
and meaningful in their domain. Canonical formation rules specify how CGs can be 
legally built and guarantee that the resulting CGs satisfy “sensibility constraints.” 
The sensibility constraints are rules in the domain which specify how a CG can be 
built, for example that the concept eats must have a theme which is food. Note that 
canonicity does not guarantee validity. A CG may be well-formed in the canononical 
formation rules for the domain, but still be false. 

A type hierarchy is established for both the concepts and the relations within a 
canon. A type hierarchy is based on the intuition that some types subsume other 
types, for example, every instance of cat would also have all the properties of 
mammal. This hierarchy is expressed by a subsumption or generalization order on 
types. 

For the reader interested in a formal treatment of these ideas (which we don’t have 
room for here), Sowa discusses his original definitions in [20] but our work follows 
the further formalized and refined versions of Sowa’s original ideas presented by 
Willems [24], by Chein and Mugnier [6, 17] and by Corbett [8, 10] . 

2. Types and Inheritance 

The set of types discussed in the previous section is arranged into a type hierarchy, 
ordered according to the specificity of each type. A type t is said to be more specific 
than a type i if t inherits information from s. We write s>t, and say that s subsumes t 
or is more general than t (or inversely, that t is subsumed by s, or is more specifc than 
s). We may also call s a supertype of t, or t a subtype of s. Equivalently to the above, 
one can write t< s. 

In early pioneering work on the unification of first-order terms, Reynolds [18] 
used the natural lattice structure of first-order terms, which was a partial ordering 
based on subsumption of terms [11]. Many terms (or types in our case) are not in any 
subsumption relation, for example cat and dog, or wood and mammal. Unification 
corresponds to finding the greatest lower bound of two terms in the lattice [13] . The 
bottom of any lattice, which is represented with the symbol ±, is the type to which all 
types can unify, and represents inconsistency. The top of the lattice, represented by 
T, is the type to which all pairs of types can generalize, and is called the universal 
type. Every type is a subtype of T. Inheritance hierarchies can be seen as lattices that 
admit unification and generalization [13]. The common specialization of two 
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Conceptual Graphs, s and t, is known as a join, and is represented as i v t. The 
common generalization of the two graphs is known as a meet, and is represented 
using the symbols s A t. 

The process of unifying Conceptual Graphs includes the process of finding the 
most general suhtypes for pairs of types of concepts, which depends on the two types 
in question being consistent. We also allow constraints on the concepts in the graphs, 
which are processed during the unification and resolution process. Unification (by 
projection) is the mechanism we use to find the solution of the constraints. In our 
work, unification is a tool which performs the work of identifying two structures 
using subsumption, where the elements of the structure can be constrained. 

3. Unification as Reasoning 

Until very recently, CGs have had no formalism for constraining real values in the 
referent of a concept. The standard method for representing and validating 
constraints has been to use type subsumption to specify which concept types (or 
subsumed subtypes) are valid in a system. One could constrain values in a knowledge 
representation system by forcing the concepts to conform to a specified type, or else 
to be subsumed by that type. A similar method applies to relations. To extend a 
previous example, the concept eats is specified to occur only between an agent which 
is an animal and a theme which is a. food. Any individual used in the animal concept 
must conform to the animal type, which means that it must either be animal, or be 
subsumed by animal, such as cat or reptile. 

One generalization of unification constraints is the use of ordering constraints, i.e., 
constraints of the form s £.t where s and t are terms. Depending on the application, 
the ordering s may have different interpretations. A concept may unify by 
subsumption with another concept if one of the concepts is a more general expression 
of the other, as defined in the partial order. There are also constraint approaches in 
logic programming where constraints are not interpreted over a single structure. An 
example for such an approach is H. Ait-Kaci’s Login [2], where first-order terms are 
replaced by feature terms. 

The formal definition of unification for Conceptual Graphs is set out in [7, 8, 10], 
however, it is essential to clarify the difference between the “join” operator and the 
general concept of unification. The difference between these two operators can be 
illustrated in the following way. In the standard canonical formation rules for 
Conceptual Graphs, unbound concepts are existentially quantified. We take for our 
example the two graphs in Figure 2, which can be interpreted as “Felix is on some 
object,” and “There is some animal sitting on that particular mat.” Joining these two 
graphs is not possible under the standard canonical formation rule for external join 
because there’s no projection from one graph to the other. However, there are 
individual concepts which can be joined, such as the concept that “Felix is a cat” and 
“animal.” However, as discussed in previous sections of this chapter, true unification 
is the knowledge conjunction of the two graphs. The unification of these two 
Conceptual Graphs would be similar to the unification of ili-terms presented by Ait- 
Kaci [1]. The unification is therefore “Felix sat on mat number 47,” as shown in 




A Method for Reasoning with Ontologies Represented as Conceptual Graphs 



133 




Figure 2. Is Felix on the mat? 

Figure 3. Here, the more general concepts of “animal,” “on,” and “object” have been 
replaced by their more specific instances. This illustrates that unification is more than 
an external join, and is composed of several operations, including join. 

Unification, however, is somewhat more complicated, and also more interesting 
and useful than merely an extension of the join operation. The unification of two 
graphs contains neither more nor less information than the two graphs being unified. 
Figure 3 shows that the unification of the two graphs in Figure 2 still retains all the 
information of the original two graphs. This is the idea behind knowledge 
conjunction [10]. 

4. The Architectural Design Tool 

There have recently been many research forays by the design community into 
computational design tools which will give the designer useful structures which can 
be combined and constrained in useful ways [3, 5, 12]. There have also been attempts 
in the CG community to assist in defining methods and techniques which will be 
useful in computational design [8, 9, 14, 25] . 

The results discussed in this section are those recorded from the application of the 
Conceptual Graphs reasoning tool operating over the domain of architectural design. 
The domain knowledge is represented as Conceptual Graphs with constraints. Here, 
we demonstrate the idea behind the reasoning mechanism by employing order sorted 
unification and constraints within the domain of architectural design. 

The concepts discussed previously were implemented in Allegro Common Lisp on 
a Sun Workstation. All of the relations were implemented as lisp functions, and all 
data stmctures were lisp lists. Many different types of designs were detailed, in order 
to cover a wide range of generic design problems, all type hierarchies and 
subsumption problems, and various relations. These designs were unified in various 
combinations, in order to test the functionality of unification with and without 




Figure 3. Felix is on the mat. 
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constraints of various kinds, and to demonstrate the usefulness of the reasoning 
technique. The combinations represented the types of design problems that can be 
encountered in the real world. 

The point of automated search for the designer is to use computer media that 
engage designers in exploring design modifications. The design user may want to 
create new designs, or index, compare or adapt existing designs. This type of user 
requires efficient representations for the designs and states (of designs) in a symbol 
system [25] . The designer needs to be able to represent spaces of possibilities which 
are both relevant to design exploration and lend themselves to tractable computations. 
It is necessary for the design process that the information in the system can be ordered 
by specificity, since design exploration usually means starting from an under- 
specified design and proceeding to a more specialized state. 

Consider a design for the kitchen of a custom-made house. In this design, the 
architect has specified some of the lighting design and that the floor area must be 
greater than 20 square meters. The architect has also retreived an old design, which 
specifies the remainder of the lighting design. The graphs specifying these two 
designs are shown in Figure 4. We assume that the portion of the graphs not shown in 
the diagram are compatible. The unification software discussed above combines 
these two graphs, with the result shown in Figure 5. In this graph, all the original 
knowledge of the first two graphs has been preserved, and the values in the concepts 
have been joined as specified. 

Another example would have a design similar to the second in Figure 4, 
specifying most of the lighting design. Another graph would represent a kitchen 
design where only the plumbing design is specified. These two would unify since the 
two heads are compatible, and the remainder of the graphs would be included in the 
unified graph. All of the knowledge is represented in the unified graph, which would 
specify the design for the lighting and the plumbing. 

These examples also illustrate how the interval type would allow real numbers to 
be represented in CGs. Any real number could be bounded inside an interval, similar 
to the concept of using floating point numbers to represent real numbers in software. 
Further, any concept containing a real value can be constrained in an interval. This 
allows the representation in CGs of constraint satisfaction problems. This use of 
interval constraints to represent real constraints has been used for some time in the 
Constraint Satisfaction Problem community. The work by van Hentenryck [22, 23] is 
a good example of intervals in CSP. 

5. Results and Discussion: Architectural Design Tool 

These graphs can be used to efficiently represent a building design ontology. The use 
of Conceptual Graphs is an efficient method for representing not only the designs, but 
also constraints on the designs and knowledge conjunction of designs. The system 
described in this paper allows general designs to be represented as concepts, and also 
allows values to be constrained by specifying real -valued constraints as intervals. 
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Figure 5. The unified 
design. 



The three main areas where the architects 
want the contribution of Conceptual Graph 
unification are in type subsumption, knowledge- 
level reasoning, and pattern matching. First, 
architects want to be able to use type 
subsumption to make statements such as, “An 
office (or kitchen, or corridor) is a kind of room. 
All the properties which apply to one should 
apply to its specializations.” This is distinct from 
the object-oriented objective of objects inheriting 
all the properties of a class of objects. The 
essential difference is in treating a kitchen as you 
would any generic room. A generic room can be 
placed, occupy space, and have attributes like 
color and number of doors. A class of rooms will 
have attributes, but cannot be said to occupy a 
space or have specific dimensions, or have a 
specific count or placement of doors. 

Conceptual Graphs and the Unify software 
that we developed give this ability to the 
architects. The Unify algorithm allows the user 
to specialize designs by matching (unifying) 
previous designs with the current design problem. 
Since all characteristics, attributes and constraints 
are carried along in the unification, the 
specialization represents all of the design 
concepts included in the more generic design. 
Further, and more importantly, there is no real 
separation between generic and specific, since all 
points in between can be represented. Conceptual 
Graphs combined with the ability to specialize 
using unification are the ideal tool for the 
knowledge combination approach and the 
constructive nature of architectural design. 

The second major concern of arcitectural 
designers was the ability to have knowledge-level 
reasoning. That is, they want to be able to speak 
in the language of the architect, not the language 
of the computer (or CAD system). The user 
wants to be able to refer to the “North Wall” or 
“door” without resorting to discussing geometric 
coordinates in space. The user wants to depart 
from previous CAD-based data-level processing. 
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and work at the knowledge level in the architecture domain. 

This is certainly another area where Conceptual Graphs and unification combine 
to bring a solution to this domain. While spatial coordinates (and their constraints) 
can be stored in a graphical representation of a room, there is no need for the user to 
bother with using them. The graph can be manipulated as a whole, and treated as a 
room, rather than a square in a diagram. The completed system will not deal with 
lines and boxes, but rather with specializing entire designs for rooms (or houses, or 
office buildings). This approach frees the architect from dealing with data-level 
concerns of numbers and coordinates, and allows the architect instead to deal with the 
architectural design. 

Finally, the users want to be able to start with a high-level, generic description of a 
building, and then make queries such as, “Can this bay structure be used in the 
support structure?” or, “Do the constraints match up adequately for a particular 
technology to be used? If yes, tell me the constraints under which it is usable.” 

Once again, the work presented in this paper meets the requirements of the 
architects. A query is represented as a Conceptual Graph. The user can specify a 
type of structure for support, and make the query by attempting to unify the structure 
with the more generic design. If the unification fails, then the user knows that the 
proposed stmcture does not meet the constraints of the design problem. If the graphs 
unify, then the resulting graph will contain the constraints which must be met in order 
to make the design work. 

Overall, the system of unification over constraints on Conceptual Graphs 
presented in this paper gives a set of tools to the designer. The ability to use 
knowledge combination with constraints to handle objects at the knowledge level 
greatly leverages the ability of the designer to work efficiently. 

6. The Air Operations Officer 

As our second knowledge domain, we discuss the use of unification and constraints 
for applying rules in a defense domain. An Air Operations Officer (usually known as 
an OPSO) is the defense officer responsible for deciding the appropriate defensive 
response to an air threat. A study of the Operations Officer decision-making methods 
was recently conducted, using a cognitive modeling technique [15, 16]. The study 
was used to show the usefulness of cognitive modeling in deriving rules from expert 
knowledge. In this section, we only make use of the rules which resulted from the 
study; the cognitive modeling technique is not discussed here. 

In the domain of the Operations Officer, the magnitude of the response to an air 
threat is in proportion to the threat itself. So, if the opposing aircraft are very close, or 
if the aircraft is of a type which can cause a great deal of damage (known as a strike 
aircraft), then the response is large. If the threat is smaller, then the response is 
smaller. For example. Figure 6 shows a rule in this domain. (We have borrowed the 
style of Cao [4] to express the mle, although we do not employ Gao’s fuzzy reasoning 
here.) This graph expresses the rule that if a fighter aircraft (small threat) is between 
400 and 500 nautical miles distant, then assert a threat level of “alert 60” (the lowest 
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level of alert, in which response fighters must he ready to take off within sixty 
minutes), and a single fighter is assigned to deal with this threat. 

The assertion shown in Figure 6 unifies with the “if’ portion of this rule. The 
“then” portion represents the response to the situation, and it is asserted into the 
current world knowledge. In this manner, we can represent the decision-making 
capabilities of the Operations Officer. 

The rule shown in Figure 7 is used for a bigger and more impending threat. Any 
threat aircraft which is closer than 400 nautical miles is considered an immediate 
threat, and a response squadron must be ready very quickly. Further, a strike aircraft 
is one which can inflict a great deal of damage, and is therefore dealt with more 
severely than a fighter aircraft. 

The assertion shown in Figure 7 states that a bomber is known to be between 380 
and 390 nautical miles distant. Our type hierarchy indicates that a bomber is a type of 
strike aircraft. Because of the proximity of the threat, the response aircraft are put on 
“alert 10” status. Because of the enormity of the threat, two fighters are assigned to 
deal with the target aircraft. Again, the assertion unifies with the “if’ portion of the 
rule, causing the “then” portion of the rule to be asserted. 



A ss erti on: 



fighter: * 




[495, 510]: * 
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Assertion: 





7. Results and Discussion: The Air Operations Officer 

Conceptual Graphs and the unification algorithm can be used to efficiently represent a 
set of rules in the domain of the Air Operations Officer. The use of Conceptual 
Graphs is an efficient method for representing the complete ontology of the OPSO, 
not only in the rules, but also in the exploration and use of the knowledge of types of 
aircraft and responses. General rules can be represented as Conceptual Graphs, and 
then specialized dynamically to match the current situation and describe an 
appropriate response. 

8. Conclusions 

We have demonstrated a method for automated reasoning on ontologies, using 
Conceptual Graphs to represent the underlying ontology. Type hierarchies and the 
canonical formation rules efficiently specialize graphs into concrete instances. A 
simple unification operation, using join and type subsumption, is used to validate real 
constraints over an entire unified graph. The significance of our work is that the 
previously static knowledge representation of ontology is now a dynamic, functional 
reasoning system. 
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Selection of Tasks and Delegation of Responsibility 
in a Multiagent System 
for Emergent Process Management 
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Abstract. Emergent processes are high-level business processes; they 
are opportunistic in nature whereas production workflows are routine. 
Emergent processes may not be managed; they may contain goal-driven 
sub-processes that can be managed. A multiagent system supports 
emergent processes. Each player is assisted by an agent. The system 
manages goal-driven sub-processes and manages the commitments that 
players make to each other during emergent sub-processes. These 
commitments will be to perform some task and to assume some level of 
responsibility. The way in which the selection of tasks and the 
delegation of responsibility is done attempts to reflect high-level 
corporate principles and to ‘sit comfortably’ with the humans involved. 
Commitments are derived through a process of inter-agent negotiation 
that considers each individual’s constraints and performance statistics. 

The system has been trialed on business process management in a 
university administrative context. 

1. Introduction 

Emergent processes are business processes that are not predefined and are ad hoc. These 
processes typically take place at the higher levels of organisations [1], and are distinct 
from production workflows [2]. Emergent processes are opportunistic in nature whereas 
production workflows are routine. How an emergent process will terminate may not be 
known until the process is well advanced. Further, the tasks involved in an emergent 
process are typically not predefined and emerge as the process develops. Those tasks 
may be carried out by collaborative groups as well as by individuals [3]. The support or 
management of emergent processes should be done in a way that reflects corporate 
priorities and that ‘sits comfortably’ with the humans involved. 

From a process management perspective, emergent processes may contain 
“knowledge-driven” sub-processes as well as conventional “goal-driven” sub-processes. 
A knowledge-driven process is guided by its ‘process knowledge’ and ‘performance 
knowledge’. The goal of a knowledge-driven process may not be fixed and may mutate. 
On the other hand, the management of a goal-driven process is guided by its goal which 
is fixed. A multiagent system to manage the “goal-driven” processes is described in [4]. 
In that system each human user is assisted by an agent which is based on a generic 
three-layer, BDI hybrid agent architecture. The term individual refers to a user/agent 
pair. That system is extended here to support knowledge-driven processes and so to 
support emergent process management. The general business of managing knowledge- 
driven processes is illustrated in Fig. 1, and will be discussed in Sec. 2. The following 
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Process Knowledge Revise Process Goal 

(knowledge of what has been (what we presently 

achieved so far; how much it think we are trying 



has/should cost etc) to achieve over all) 




sections are principally a description of how the system in [4] has been extended to 
support the management of knowledge-driven processes. Sec. 3 discusses the 
management of the process knowledge. Sec. 4 describes the performance knowledge 
which is communicated between agents in contract net bids for work. Sec. 5 compares 
various strategies for evaluating these bids. 

Process management is an established application area for multi-agent systems [5]. 
One valuable feature of process management as an application area is that ‘real’ 
experiments may be performed with the cooperation of local administrators. The system 
described here has been trialed on emergent process management applications within 
university administration. 

2. Process management 

Following [2] a business process is “a set of one or more linked procedures or activities 
which collectively realise a business objective or policy goal, normally within the 
context of an organisational structure defining functional roles (also [6]) and 
relationships”. Three classes of business process are defined in terms of their 
management properties [7] (ie in terms of how they may be managed). 

• A task-driven process has a unique decomposition into a — possibly conditional — 
sequence of activities. Each of these activities has a goal and is associated with a 
task that “always” achieves this goal. Production workflows are typically task- 
driven processes. 
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• A goal-driven process has a process goal, and achievement of that goal is the 

termination condition for the process. The process goal may have various 
decompositions into sequences of sub-goals where these sub-goals are associated 
with (atomic) activities and so with tasks. Some of these sequences of tasks may 
work better than others, and there may be no way of knowing which is which [8], 
A task for an activity may fail outright, or may be otherwise ineffective at achieving 
its goal. In other words, failure is a feature of goal-driven processes. If a task fails 
then another way to achieve the process goal may be sought. 

• A knowledge-driven process may have a process goal, but the goal may be vague and 

may mutate [9]. Mutations are determined by the process patron, often in the light 
of knowledge generated during the process. At each stage in the performance of a 
knowledge-driven process the “next goat” is chosen by the process patron; this 
choice is made using general knowledge about the context of the process — called 
the process knowledge. The process patron also chooses the tasks to achieve that 
next goal; this choice may be made using general knowledge about the effectiveness 
of tasks — called the performance knowledge. So in so far as the process goal gives 
direction to goal-driven — and task-driven — ^processes, the process knowledge gives 
direction to knowledge-driven processes. The management of knowledge-driven 
processes is considerably more complex than the other two classes of process. But, 
knowledge-driven processes are “not all bad” — they typically have goal-driven sub- 
processes which may be handled in conventional way. A simplified view of 
knowledge-driven process management is shown in Fig. 1 . 

Properties of the three classes of process are shown in Fig. 2. 
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Fig 2. Properties of the three types of process 



Task-driven processes may be managed by a simple reactive agent architecture based 
on event-condition-action rules. Goal-driven processes may be modelled as state and 
activity charts [10] and managed by plans that can accommodates failure [11]. Such a 
planning system may provide the deliberative reasoning mechanism in a BDI agent 
architecture and is used in a goal-driven process management system [4] where tasks are 
represented as plans for goal-driven processes. But the success of execution of a plan for 
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a goal-driven process is not necessarily related to the achievement of its goal. One 
reason for this is that an instance may make progress outside the process management 
system — two players could go for lunch for example. So each plan for a goal-driven 
process should terminate with a check of whether its goal has been achieved. Managing 
knowledge-driven processes is rather more difficult, see Fig. 1. The complete 
representation, never mind the maintenance, of the process knowledge would be an 
enormous job. 

3. Process knowledge and the goals 

This section refers to the left-hand side of Fig. 1, and to the relationship between the 
process knowledge, the process goal and the next-goal. This is the intractable part of 
knowledge-driven process management. 

The process knowledge in any real application includes a significant amount of 
general and common sense knowledge. The system does assist in the maintenance of 
the process knowledge by ensuring that any virtual documents generated during an 
activity in a knowledge-driven sub-process are passed to the process patron when the 
activity is complete. Virtual documents are either interactive web documents or 
workspaces in the LiveNet workspace system [6] which is used to handle virtual 
meetings and discussions. 

The system records, but does not attempt to understand the process goal. Any 
possible revisions the process goal are carried out by the patron without assistance from 
the system. Likewise the decomposition of the process goal to decide “what to do 
next” — the next-goal. It may appear that the system does not do very much at all! If 
the next-goal is the goal of a goal-driven process — which it may well be — then the 
system may be left to manage it as long as it has plans in its plan library to achieve that 
next-goal. If the system does not have plans to achieve such a goal then the user may 
be able to quickly assemble such a plan from existing components in the plan library. 
The organisation of the plan library is a ffee-form, hierarchic filing system designed 
completely by each user. Such a plan only specifies what has to be done at the host 
agent. If a plan sends something to another agent with a sub-goal attached it is up to 
that other agent to design a plan to deal with that sub-goal. If the next-goal is the goal 
of a knowledge-driven process then the procedure illustrated in Fig. 1 commences at the 
level of that goal. 

So for this part of the procedure, the agent provides assistance with updating the 
process knowledge, and if a next-goal is the goal of a goal-driven sub-process then the 
system will manage that sub-process, perhaps after being given a plan to do so. 

4. Performance knowledge 

This section refers to the right-hand side of Fig. 1. That is the representation and 
maintenance of the performance knowledge. The performance knowledge is used to 
support task selection — ie who does what — through inter-agent negotiation. Its role is 
comparative — to decide which choice is better than another. It is not intended to have 
absolute currency. With this use in mind, the performance knowledge comprises 
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performance statistics on the operation of the system. These performance statistics are 
proffered by an agent in bids for work. 

The system achieves its goal through the way in which the agents interact. Five 
groups of inter-agent communication types may be received by a particular agent A. 

• Sub-goal distribution group. A command and an invitation for agent A to submit 
bids to assume responsibility for a sub-goal for a particular process instance. An bid 
and a declination to take responsibility for a sub-goal. A commitment that 
responsibility has been taken for a sub-goal. 

• Activity completion group. A declaration that an activity initially intended to 
achieve a next-goal has been, or has yet to be, completed, and that certain associated 
process knowledge was derived during that activity. The assent that a declaration 
has been accepted by another agent. The refusal to accept a declaration by another 
agent for some reason. 

• Declarative group. A request to another agent for a fact. An assertion by another 
agent of a fact. The acknowledgment that a communicated fact is satisfactory or is 
unsatisfactory for some reason. 

• Authority group. The delegation and retraction of authority to bid for certain types 
of sub-goal. 

• Priority group. An instruction (ie a command) and an appeal (ie a request) to 
modify the agent’s priorities between sub-goals. The agreement and the refusal for 
some reason to comply. 

The basis of the agent interaction is negotiation. Negotiation is achieved through 
the bidding mechanism of contract nets with focussed addressing [12]. A bid consists 
of the five pairs of real numbers (Constraint, Allocate, Success, Cost, Time). The pair 
Constraint is an estimate of the earliest time that the individual (ie. agent/human pair) 
could address the task — ie ignoring other non-urgent things to be done, and an estimate 
of the time that the individual would normally address the task if it “took its place in 
the in-tray”. The agent receiving a bid attaches a subjective estimate of ‘value’ to that 
bid. The pair Allocate is an estimate of the mean density of work flowing into the 
agent (ie. delegated to the agent) and an estimate of the mean density of work flowing 
out of the agent (ie. delegated by the agent). A success parameter is the likelihood that 
an agent will complete work within the constraints prescribed. A time parameter is the 
total elapse time that the agent takes to complete work. A cost parameter is the cost — 
usually measured in time expended on the Job — that the agent takes to complete the 
work. The three parameters success, time and cost are assumed to be normally 
distributed (success is binomially distributed but is approximately normal under the 
standard conditions). The pairs Success, Time and Cost are estimates of the means and 
standard deviations of these three parameters. 

The estimates of the means and standard deviations of the three parameters success, 
time and cost are made on the basis of historical information and are useful as long as 
performance is statistically stable. If performance is unstable then an agent may have 
some idea of the reason why, such reasons may result from communication with other 
agents. These reasons may be used to revise the “historically based” estimates to give 
an informed estimate of performance that takes into account the reasons why things 
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behaved the way they did [7]. The performance estimates are used for two distinct 
purposes. First, they are used by each agent’s deliberative reasoning mechanism to 
decide which plan to use for what. Second, they are used by the bidding mechanism by 
which agents interact and so take responsibility for sub-processes. 

Unfortunately, the important value parameter (ie. the value added to a process) is 
often very difficult to measure [12]. Some progressive organisations employ 
experienced staff specifically to assess the value of the work of others. The existing 
system does not attempt to measure value; each individual simply represents the 
perceived subjective value of each other individual’s work as a constant for that 
individual. These constant subjective estimates are attached to each incoming bid. 

The deliberative reasoning mechanism of the three-layer BDI agents is based on the 
non-deterministic procedure; “on the basis of current beliefs — identify the current 
options, on the basis of current options and existing commitments — select the current 
commitments (or goals), for each newly-committed goal choose a plan for that goal, 
from the selected plans choose a consistent set of things to do next (called the agent’s 
intentionsy' . To apply this procedure requires a mechanism for identifying options, for 
selecting goals, for choosing plans and for scheduling intentions [9]. The problem of 
selecting a plan from a set of plans is equivalent to choosing a path through a Targe’ 
composite plan that contains disjunctive nodes; this problem is expressed in terms of 
choosing such a path here. A plan or path may perform well or badly. A path’s 
performance is defined in terms of; the likelihood that the path will succeed, the 
expected cost or time to execute the path, the expected value added by the path, or some 
combination of these measures. If each agent knows how well the choices that it has 
made have performed in the past then it can be expected to make decisions reasonably 
well as long as path performance remains reasonably stable. One mechanism for 
achieving this form of adaptivity is reinforcement learning [13]. An alternative approach 
based on performance estimates is described in [4]. 

Inferred explanations of why an observation is outside expected limits may 
sometimes be extracted from observing the interactions with the users and other agents 
involved. Inferred knowledge such as this gives one possible cause for the observed 
behaviour; so such knowledge enables us to refine, but not to replace, the historical 
estimates of parameters. 

5. Task Selection 

This section concerns the selection of a task for a given now-goal as shown in the 
middle of Fig. 1. The selection of a plan to achieve a next goal typically involves 
deciding what to do and selecting who to ask to assist in doing it. The selection of 
what to do and who to do it can not be subdivided because one person may be good and 
one form of task and bad at others. So the “what” and the “who” are considered 
together. The system provides assistance in making this decision. Sec. 5 describes 
how performance knowledge is attached to each plan and sub-plan. For plans that 
involve one individual only this is done for instantiated plans. That is there are 
estimates for each individual and plan pair. In this way the system offers advice on 
choosing between individual A doing X and individual B doing Y. For plans that 
involve more than one individual this is done for abstract, uninstantiated plans only. 
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This is something of a compromise but avoids the system attempting to do the 
impossible — for example, maintaining estimates on performance of every possible 
composition of committee. This does not weaken the system if a plan to form a 
committee is embedded in a plan that gives an individual the responsibility for forming 
that committee, because estimates are gathered for the performance of the second of 
these. 

There are two basic modes in which the selection of “who” to ask is done. First the 
authoritarian mode in which an individual is told to do something. Second the 
negotiation mode in which individuals are asked to express an interest in doing 
something. This second mode is implemented using contract nets with focussed 
addressing [14] with inter-agent communication being performed in KQML [15]. When 
contact net bids are received the successful bidder has to be identified. So no matter 
which mode is used, a decision has to be made as to who to select. The use of a multi- 
agent system to manage processes expands the range of feasible strategies for delegation 
from the authoritarian strategies described above to strategies based on negotiation 
between individuals. Negotiation-based strategies that involves negotiation for each 
process instance are not feasible in manual systems for every day tasks due to the cost of 
negotiation. If the agents in an agent-based system are responsible for this negotiation 
then the eost of negotiation is may be negligible. A mechanism is described here to 
automate this negotiation. 

If the agent making a bid to perform a task has a plan for achieving that task then 
the user may permit the agent to construct the bid automatically. As the bids consist of 
six meaningful quantities, the user may opt to construct the bid manually. A bid 
consists of the five pairs of real numbers (Constraint, Allocate, Success, Cost, Time). 
The pair constraint is an estimate of the earliest time that the individual could address 
the task — ie ignoring other non-urgent things to be done, and an estimate of the time 
that the individual would normally address the task if it “took its place in the in-tray”. 
The pairs Allocate, Success, Cost and Time are estimates of the mean and standard 
deviation of the corresponding parameters as described above. The receiving agent then; 

• attaches a subjective view of the value of the bidding individual; 

• assesses the extent to which a bid should be downgraded — or not considered at all — 

because it violates process constraints, and 

• selects an acceptable bid, if any, possibly by applying its ‘delegation strategy’. 

If there are no acceptable bids then the receiving agent “thinks again”. 

6. The delegation strategy 

A delegation strategy is a strategy for deciding who to give responsibility to for doing 
what. A user specifies the delegation strategy that is used by the user’s agent to 
evaluate bids. In doing this the user has considerable flexibility first in defining payoff 
and second in specifying the strategy itself. Practical strategies in manual systems can 
be quite elementary; delegation is a job which some humans are not very good at. A 
delegation strategy may attempt to balance some of the three conflicting principles: 
maximising payoff, maximising opportunities for poor performers to improve and 
balancing workload. Payoff is defined by the user and could be some combination of 
the expected value added to the process, the expected time and/or cost to deal with the 
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process, and the expected likelihood of the process leading to a satisfactory conclusion 
[16]. 

The system provides assistance to the user by suggesting how delegation could be 
performed using a method that the user has specified in terms of the tools described 
below. The user can opt to let the system delegate automatically, or can opt to delegate 
manually. 

Given a sub-process, suppose that we have some expectation of the payoff Dj as a 
result of choosing the i’th individual (ie agent and user pair) from the set of candidates 
{Xi,...,Xi,...,Xn} to take responsibility for it. A delegation strategy at time x is 
specified as S = {Pj ,...,Pj,...,Pn} where Pj is the probability of delegating 
responsibility at time x for a given task to individual Xj chosen from 




for a learning rate = 0.1, death factor = 0.05, and a = 0.6. 

{Xi,...,Xi,...,Xn}. The system suggests an individual/task pair stochastically using the 
delegation strategy. 

Corporate culture may determine the delegation strategy. Four delegation strategies 
are described. If corporate culture is to choose the individual whose expected payoff is 
maximal then the delegation strategy best is: 
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Pi = 




z/Xj is such that Pr(Xj ») is maximal 



[ 0 otherwise 



where Pr(Xj ») means “the probability that Xj will have the highest payoff’ and m is 
such that there are m individuals for whom Pr(Xi ») is maximal. In the absence of any 
other complications, the strategy best attempts to maximise expected payoff. Using this 
strategy, an individual who performs poorly may never get work. Another strategy prob 
also favours high payoff but gives all individuals a chance, sooner or later, and is 
defined by Pi = Pr(Xj »). The strategies best and prob have the feature of 
‘rewarding’ quality work (ie. high payoff) with more work. If corporate culture dictates 
that individuals should be treated equally but at random then the delegation strategy 

random is Pj = — . If the corporate culture dictates that each task should be allocated 
to m individuals in strict rotation then the delegation strategy circulate is: 



p _ r 1 if this is the i’th trial and i = 0 (modulo n) 

’ 0 otherwise 

The strategies random and circulate attempt to balance workload and ignore expected 
payoff. The strategy circulate only has meaning in a fixed population, and so has 
limited use. 

A practical strategy that attempts to balance maximising “expected payoff for the 
next delegation” with “improving available skills in the long term” could be constructed 
if there was a model for the expected improvement in skills — ie a model for the rate at 
which individuals learn. This is not considered here. 

An admissible delegation strategy has the properties: 

• z/Pr(Xi ») > Pr(Xj ») then Pj > Pj 

• if Pr(Xj ») = Pr(Xj ») then Pj = Pj 
•Pj>0 (Vi) 

So the three strategies best, random and circulate are not admissible. An admissible 
strategy will delegate more responsibility to individuals with a high probability of 
having the highest payoff than to individuals with a low probability. Also with an 
admissible strategy each individual considered has some chance of being given 
responsibility. The strategy prob is admissible and is used in the system described in 
[4]. It provides a balance between favouring individuals who perform well with giving 
occasional opportunities to poor performers to improve their performance. The strategy 
prob is not based on any model of process improvement and so it can not be claimed to 
be optimal in that sense. The user selects a strategy from the infinite variety of 
admissible strategies: S = 6 x best + e x prob + ([» x random + y x circulate will be 
admissible if S,e,(j),Y 6 [0,1], 5 + e + (|) + y= 1 and if e > 0. This leads to the 
question of how to select a strategy. As circulate is only meaningful in stable 
populations it is not considered here. 

A world is designed in which the relative performance of the four strategies best, 
prob, random and circulate are simulated There are always three individuals in this 
world. If individuals die (ie they become unavailable) then they are replaced with new 
individuals. At each cycle — ie a discrete time unit — one delegation is made. There is a 
natural death rate of 5% for each individual for each cycle. The payoff of each 
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individual commences at 0 and improves by 10% of “what there is still to learn” on 
each occasion that an individual is delegated responsibility. So an individual’s recorded 
payoff is progressively: 0, 0.1, 0.19, 0.271, 0.3439, and so on, tending to 1.0 in the 
long term. The mean and standard deviation estimates of expected payoff are calculated 




as described above in Sec. 4 
using a value of a = 0.6. In 
addition the individuals have a 
strength of belief of the extent to 
which they are being given more 
work than the other two 
individuals in the experiment. 

This strength of belief is 
multiplied by a “rebel” factor 
and is added to the base death 
rate of 5%. So if work is 
repeatedly delegated to one 
individual then the probability 
of that individual dying 
increases up to a limit of the 

rebel factor plus 5%. A triple duplication occurs when work is delegated to the same 
individual three cycles running. The proportion of triple duplications is used as a 
measure of the lack of perceived recent equity in the allocation of responsibility. The 
payoff and proportion of triple duplications for the four strategies are shown against the 
rebel factor on the top and bottom graphs respectively in Fig. 3. The simulation run for 
each value is 2 000 cycles. The lack of smoothness of the graphs is partially due to the 
pseudo-random number generator used. When the rebel factor is 0.15 — ie three times 
the natural death rate — all four strategies deliver approximately the same payoff. The 
two graphs indicate that the prob strategy does a reasonable job at maximising payoff 
while keeping triple duplications reasonably low for a rebel factor of < 0.15. However, 
prob may only be used when the chosen definition of payoff is normally distributed. 
The strategy best also assumes normality; its definition may be changed to “such that 
the expected payoff is greatest” when payoff is not normal. 



Fig 4. Setting up a task in the system 



7. Conclusion 



High-level business processes are analysed as being of three distinct types [17]. The 
management of knowledge-driven processes has been described. An existing multi- 
agent system for goal-driven process management [4] has been extended to support the 
management of knowledge-driven processes. The conceptual agent architecture is a 
three-layer BDI, hybrid architecture [18]. During a process instance the responsibility 
for sub-processes may be delegated. The system forms a view on who should be asked 
to do what at each step in a process. Each user defines payoff in some acceptable way. 
Payoff may be defined in terms of estimates of various parameters. These estimates are 
based on historic information; they are revised if they are not statistically stable. Using 
three basic built-in strategies, the user then specifies a delegation strategy for the chosen 
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definition of payoff. In this way the system may be permitted to handle sub-process 
delegation automatically. The system has been trialed on an application in a university 
administrative context. Three delegation strategies [5 = 0.5, e = 0.5, (|) = 0], prob and 
[5 = 0, e = 0.5, (j) = 0.5] represent varying degrees of the “aggressive pursuit of payoff’ 
and have been declared “reasonable” in very limited trials. 
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Abstract. A well-known circumscription policy in situation calculus 
theories of actions is to minimize the Abnormality predicate by varying 
the Holds predicate. Unfortunately this admitted counter-intuitive mod- 
els. A different policy of varying the Result function eliminated these 
models. Explanations of how it did this are not entirely satisfactory, but 
seem to appeal to informal notions of state minimization. We re-examine 
this policy and show that there are simple justifications for it that are 
based on classical automata theory. It incidentally turns out that the 
description “state minimization” for the varying Result policy is more 
accurate than the original nomenclature had intended. 



1 Introduction 

Logical approaches to reasoning about action in a discrete time setting have 
mostly adopted the situation calculus (SC) as the representation language. It 
was invented by McCarthy and Hayes [McCarthy and Hayes 69] as a formal lan- 
guage to capture discrete dynamics. The advantage of the SC is that it is based 
on a multi-sorted first-order language, a syntax with an established and unam- 
biguous pedigree. Correspondingly, its semantics is conventional and its inference 
mechanism is a suitably augmented first-order logic. The aim of this paper is 
to explain, using routine ideas from classical automata theory, why a variant 
nonmonotonic minimization policy in reasoning about actions is successful. As 
a bonus, not only useful connections with automata theory are exposed, but 
the prospect of “compiling” this minimization policy into standard automata 
synthesis algorithms is enhanced. We assume some familiarity with the SC and 
circumscription, a contemporary and detailed account of which is the monograph 
by Shanahan [Shanahan 1997]. However, casual reviews of the relevant concepts 
will be made as we proceed. The automata theory assumed here is treated in 
standard texts such as [Arbib 1969a] and [Booth 68]. Again, we will informally 
review the required concepts as needed. 



2 Situation Calculus and Models 

The SC is a multi-sorted first-order theory in which the sorts are: Sit (situations), 
Act (actions), Flu (fluents). There is a binary function Result : Act x Sit Sit, 
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and a binary predicate Holds : Flu x Sit. Sit has a distinguished member sO, 
called the initial situation. (In some formulations, Result is written as Do.) 
The intuition is that Sit is constructed from sO as results of actions from Act, 
denoting the result of action a on situation s by Result{a,s). Thus the ground 
terms of Sit is the Herbrand universe of this theory in which the constants are sO 
and the action names, and the constructing function is Result. The fluents are 
intuitively the potentially observable properties of the situations, as it may be the 
case that a particular theory is not stringent enough to determine completely 
the status of all fluents in all situations. Fluent observation is via the Holds 
predicate, e.g., ~^Holds{f, Result{a2, {Result{al, sO)))) says that the fluent / 
does not hold after the actions al and a2 are applied (in that order) to the 
initial situation. A model of a situation calculus theory is a Herbrand model. 
The unique names assumption (different constants denote different objects) for 
action and fluent sorts follows from the structure of the Herband universe, and 
likewise the observation that distinct situation terms denote distinct objects. 

Consider the set A of formulas: 

Holds{f, sO) A Holds{g, sO) (1) 

~^Holds{f, Result{a, s)) ^ Holds{f, s) A Holds{g, s) (2) 

Holds{g, Result{b, s)) ^ Holds{f, s) A Holds{g, s) (3) 

This is a simple example of an action specification using the SC. Here Flu = 
{f,g} and Act = {a, 6}. It is convenient for diagrammatic representation to 
adopt the notational convention that Holds{-if, s) means ^Holds{f, s). Formula 
1 is an observation sentence, while formulas 2 and 3 are effect axioms. 

Figure 1 is a diagram that displays a fragment of Th{A), the deductive 
closure of A. This diagram sets down the minimal consequences of A, so that a 
fluent (output) F (respectively ~^F) appears in a situation (state) S if and only 
a A \= Holds{F, S) (respectively A ^ ^Holds{F, S)). 




• 9 9 O 

9 9 9 ® 

Fig. 1. Minimal information situation tree of Th(A) 

This tree is incomplete (even ignoring the fact that it represents only an 
initial fragment of action terms) in the sense that the status of fluents in some 
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situations is unknown. These situations can be completed in any way that is 
consistent with A, and Figures 2 and 3 are two possibilities. The former is an 
inertial completion; no fluent changes from one situation to the next unless 
it is forced to do so by an effect axiom or other logical axiom (e.g. domain 
constraints or causal rules that are not discussed here). The latter completion is 
more arbitrary, and is not only non-inertial (fluent / in sO changes after action h 
when it is not required to do so) but also non-Markovian in that two situations 
si and s2 that are identical in terms of fluent status nevertheless react differently 
to the same action b. In general, for any given SC action specification A, the 
class Mod{A) of its models (trees) comprises all the possible completions of the 
incomplete fragments as indicated by the examples. 




Fig. 2. One completion of the minimal information tree. 




Fig. 3. Another completion of the minimal information tree. 



In order to keep the exposition simple, we will impose a restriction on the 
theory A, that the actions are essentially deterministic. In terms of the situation 
trees this means that for each action, say a, there is at most one edge leading 
from a situation s to a successor situation Result{a, s) labelled by a. We say 
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“at most” because it may be the case that the action a cannot be performed 
in situation s, or in the parlance of action theories, s does not satisfy the pre- 
conditions of a. The restriction is inessential, but will help focus the discussion 

3 Automata 

A (Moore) automaton At is a quintuple (/, Q, Y, 6, A) where I is the set of in- 
puts, Y is the set of outputs, Q is the state set, 6 : Q x I ^ Q is the state 
transition function, and A : Q — > M is the output function. The state set can be 
infinite. See [Booth 68] or [Arbib 1969a] for details. Diagrammatically, automata 
are traditionally represented by using labelled circles for states and labelled arcs 
for transitions. The labels in circles are the outputs for the states, and the arcs 
are labelled with names of the inputs. Therefore, it is no suprise that situation 
calculus models displayed in Figures 2 and 3 are also naturally interpreted as 
automata. To be more formal about this interpretation, we need the counterpart 
of the procedure outlined above for “completing” a situation. A fluent literal L is 
either a fluent / or its negation ^/. For any fluent / the opposite of / is ->/, and 
the opposite of is /. A set X of fluent literals is consistent if it does not have 
opposite literals; it is complete if for every fluent /, either f G X or G X . X 

is maximal if it is both consistent and complete. Now given a SC action theory 
A, the translation from its models (situation trees) to automata is a natural one: 
Q is Sit, I is Act, Y is where Max Lit is the family of maximal sets 

fluent literals. The transition function S is just as naturally defined as follows: 
if q is the situation s, then S{q,a) = q' where q' is the situation Result{a,s). 
Likewise, the output function A will say which fluent literals hold in the situa- 
tion, viz., L S A(s) iff Holds{L, s). Informally, all this is saying is that one may 
choose to read the diagrams for completed situation trees either as models of an 
action theory or as (infinite, loop-free) automata. While it is possible to fully 
formalize the translation above as a functor between two categories (SC trees 
and automata), we believe that the identifications indicated are so natural that 
further formalization would be pedantic. 



4 Circumscription 

Having set up the apparatus for describing models of the SC and their alterna- 
tive identification as automata, we are now ready to see how the latter can be 
used to simplify understanding of a circumscription policy used to reason about 
actions. We begin with a review of a circumstance which justified this policy. We 
assume acquaintance with circumscription for minimizing extensions of predicate 
and refer the reader to [Lifschitz 1994] and [Shanahan 1997] for contemporary 
summaries and details. McCarthy’s paper [McCarthy 86] that introduced the 
fundamentals of circumscription has invaluable insights. 

Historically, the idea behind a nonmonotonic rule for reasoning about actions 
was to capture inertia as we defined it above, viz., if there is no reason for a 
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fluent literal to change after an action, then it should not. One way to do this is 
to introduce a formula 

Ab{a, /, s) ^ ^[Holds{f, Result{a, s)) Holds{f, s)] (4) 

in which the predicate Ab{a, f, s) essentially says that fluent / is abnor- 
mal in situation s under the action a because it changes. The idea is that 
most actions have only local effects, e.g., painting a shelf should not affect 
things other than the color of the shelf; so most fluents will be normal. Now, 
if among all models of A we were to choose those which minimized the exten- 
sion [[^6]] of Ab relative to set-containment, while letting the Holds predicate 
vary, then in such a model M it will be the case that M ^ Ab{a, /, s) ^ 
^[Holds{f,Result{a,s)) ^ Holds{f,s)]. It is plausible that this M would be 
inertial because of the following reasoning. Suppose that M is not inertial for 
a triple (a,f,s), say M \= Holds{f, s), A ^ ^Holds{f, Result{a, s)) but M \= 
^Holds{f,Result{a,s)). Observe that in this case we have M ^ Ab{a,f,s), so 
ostensibly there may be an alternative model M' which agrees with M every- 
where except that M' ^ Ab{a, /, s). However, the Yale Shooting Problem (YSP) 
[Hanks and McDermott 1987] showed that this reasoning can be misleading. The 
SC theory of the YSP is: 



Holds{L, Result{Load, s)) (5) 

~^Holds{A, Result{Shoot, s)) ^ Holds{Loaded, s) (6) 

Holds{A, SO) (7) 

~^Holds{L, S'O) (8) 



The YSP proved to be a fatal counter-example to the hope that by merely 
minimizing |H.6]| it is possible to enforce global inertia. This was demonstrated 
by exhibiting two models I and II — see Figure 4 — for the YSP which were 
incomparable for their respective extensions []7l6]|j and and yet both 

were minimal for these extensions. The situation tree fragments in the figure are 
laid out horizontally rather than vertically, and only some of the relevant actions 
are shown from each situation. 

The problem is that in minimizing Ab across a trajectory (a sequence of 
actions) rather than a single action, it may be possible to trade off a later 
abnormality for an earlier one, or vice-versa. 

Now, to recall one well-known way to overcome this, we review the vocabulary 
of circumscription. Instead of saying that we mimimize the predicate Ab, we can 
conform to the practice in nonmonotonic logic by saying that we circumscribe 
it. The notation Circ{A{P); P; Q) denotes a second-order formula which means 
the circumscription of the predicate P in action theory A{P) in which P appears 
as a predicate constant, letting the predicate (or function) Q vary — for details 
see [Lifschitz 1994] and [Shanahan 1997]. So, by adding Circ{A{Ab); Ab; Holds) 
to the YSP theory augmented with the abnormality formula 4, we And that 
there are two models I and II as shown. By “letting Holds vary” is meant the 
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Ab(i) = {<ioad,L,sO>,<shoot,A,s2>} 
Ab(ii) = {<ioad,L,sO>,<wait,L,s1>} 



Fig. 4. Partial situation trees for the Yale Shooting Problem 



policy of choosing (as far as possible consistent with the rest of the theory) which 
fluent literals should hold in any situation. This is equivalent to completing the 
circles in situation trees by Ailing in literals, or what amounts to the same thing, 
choosing certain completed trees. The choice is made to minimize or circumscribe 
the extension of Ab. 

The flurry of research that followed the discovery of the YSP led to many 
alternative policies for circumscription. One line of attack which was remarkably 
successful was due to Baker [Baker 1991], to which we now turn. 

5 Varying Result 

Baker’s idea was to change the circumscription policy to allow the Result func- 
tion to vary rather than the Holds predicate. We illustrate the difference dia- 
grammatically in Figure 5. In this figure, the situation s is a typical one in a 
situation tree, or equivalently a state in its corresponding automaton. Varying 
Holds is tantamount to Ailing in consistent fluent literals in the situation (state) 
Result{a,s). Varying Result is to “re-target” the situation to which the func- 
tion value Result(a, s) should point. The corresponding automaton may then no 
longer be loop-free. 

In either case, the objective is to circumscribe Ab. The formula that expresses 
varying Result is denoted by Circ{A{Ab); Ab; Result). If this is the policy that 
is used for the YSP, it is not hard to see that only one model survives, for the 
reason suggested by Figure 6. This policy is informally described by its advocates 
as State Minimization, presumably because it provided more compact situation 
configurations due to the permitted re-targettings. However, as we shall argue, 
this nomenclature is in fact more accurately descriptive than was initially 
imagined. In this figure, the second (unintended) model II of the YSP (we have 
added for a more complete picture the shoot action in situation si) is modified 
to that of IF in which the abnormality Ab{W ait , L , si) is eliminated by retar- 
getting Result(Wait, si) back to si. This strictly reduces the extension of Ab in 
model II (nothing else has changed), so model II is not a circumscriptive model of 
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Varying Result means the freedom to choose where Result(a,s) should point 

Fig. 5. Varying Holds vs. Varying Result 
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Fig. 6. Varying Result in model II of the YSP 
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the YSP (augmented by the abnormality axiom 4) if Result is allowed to vary. 
Observe that the corresponding automaton now has a “self-loop” at si, and 
moreover, if sO is regarded as an initial state, then s2 is no longer reachable. 

What about model I of the YSP? Well, if we apply the standard automaton 
reduction procedure (see [Booth 68] [Arbib 1969a]) to both I and IP of Figures 4 
and 6 respectively, it can be verified that the resulting automata are isomorphic, 
so in fact there is only one circumscribed model of the YSP under this policy. 
This reduced automaton is shown in Figure 7. 




Ab(l) = f<load.L.sO>.<shoot.A.s2>) 

Fig. 7. Reduced Automaton of Models I and IF 



It is natural to ask if there any general characterization of such models from 
the automaton-theoretic perspective? The next section addresses this. 

6 Abnormality for Automata 

In circumscribing Ab by varying Result we obtain situation tree models that are 
also interpretable as automata. There is therefore no difficulty in understanding 
what is meant by the extension of Ab for an automaton M. But to say the 
almost obvious, for an automaton M, a triple (a, /, q) is in []A6]]^ if the fluent / 
changes from state q to state 6{q, a). Let be the reduced automaton of M. We 
recall that automaton reduction is essentially a process of identifying states that 
cannot be distinguished by any input-output experiment. In our context, inputs 
correspond to action sequences and outputs to sets of literals which hold at the 
ends of such sequences. Figure 8 is a schematic representation of this observation. 
A minor technical issue is how to compare abnormalities in an automaton with 
its reduced form, e.g, if we have (a, /, ql) and (a, /, q2) be abnormal in M, and in 
its reduction ql is identified with q2, what is the status of the abnormality 
(a, /, \ql, q2]) in with respect to the two abnormalities which it inherits? The 
most elegant way to deal with this is to refer again to the idea in Figure 8. We 
can stipulate that {a, f ,[ql,q2]) € {{a, f,ql),{a, f,q2)} since the input-output 
perspective permits one to interpret the former as any member of the latter set. 

If |A6 m 1 i® the extension of abnormality in automaton M, and ][A6 m^k] is 
that in its reduced automaton M^, the following is not hard to see. 

Proposition 1 []A&mr1 ^ [[A&mI- 
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a1,a2,a3,..,aN 



blackbox 



Holds(f,Result(aN (Result(a1,s))..)? 



Identical answers from two blackboxes 
means they are Abnormality-equivalent 

Fig. 8. The Inpnt-Ontput View of SC Models 



The upshot of Proposition 1 is that nothing is lost in considering reduced 
automata as models of the SC under the circumscription policy of letting Result 
vary. Moreover, it suffices to search for such models of the SC within the class 
of reduced automata since these cannot be distinguished from those models by 
input-output queries. 

Definition 1 Let an automaton M he such that it has an a-lahelled transition 
from state q to S{q,a) and the fluent / has not changed between the two states. 
Then we say that the triple {a, f, q) is inertial. 

Observe that if such an M is also a model of a SC theory, then (a, /, q) is inertial 
if and only if this triple is not in [[yl&Ml- 

Definition 2 If {a, /, q) is inertial for every fluent f in an automaton M , then 
we say that the a-labelled transition is inertial. 

For a reduced automaton, an inertial transition is often (but not always) a 
“self-loop”. It can be shown that this will be the case if the theory A admits 
Markovian models [Foo and Peppas 2001]. 

Proposition 2 Let a reduced automaton M he a Circ{A{Ab); Ab; Result) model 
of action theory A. Consider any automaton M' which is obtained from M by 
transforming a non-inertial triple in M to an inertial one. Then M' cannot he a 
model of Circ{A{Ab); Ab; Result) unless it undergoes a further transformation 
in the reverse, i.e., at least one of its inertial triples is made non-inertial. 



Corollary 1 No reduced automaton M can he a model of Circ{A{Ab); 
Ab; Result) if it can be transformed to an automaton M' by changing a non- 
inertial triple to an inertial one, with M' also being a model of A. 

Corollary 1 guarantees that i?es?xlt-varying circumscribed models cannot ig- 
nore all opportunities for local inertia. It entails the consequence below. 

Corollary 2 Suppose for some situation s and action a it is the case that for 
some fluent f , A is consistent with ^Ab{a, f, s). Then there is a reduced automa- 
ton M which is a model of Circ{A{Ab); Ab; Result) that is inertial for (a,f,q) 
where q is the state identified with s. 
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The stronger form of Proposition 2 has an attractive automaton-theoretic 
flavor. 

Proposition 3 Let a reduced automaton M he a Circ{A{Ab); Ab; Result) model 
of action theory A. Consider any automaton M' which is obtained from M by 
transforming a non-inertial transition in M to an inertial one. Then M' cannot 
he a model of Circ{A{Ab); Ab; Result) unless it undergoes a further transforma- 
tion in which at least one inertial triple is made non-inertial. 

Diagrammatically this simply means that one cannot change a non-self-loop 
in M to a self-loop and expect it to remain a circumscribed model, unless it also 
gives up at least one inertial triple elsewhere. The corollaries above also have 
corresponding strong forms below. 

Corollary 3 No reduced automaton M can he a model of Circ{A{Ah); Ab; 
Result) if it can he transformed to an automaton M' by changing a non-inertial 
transition to an inertial one, with M' remaining a model of A. 



Corollary 4 Suppose for some situation s and action a it is the case that for all 
fluents f, A is consistent with ^Ab{a, /, s). Then there is a reduced automaton M 
which is a model of Circ{A{Ab); Ab; Result) with an a-labelled inertial transition 
at state q, the state identified with s. 

These corollaries provide simple explanations of why the policy of varying 
Result is successful in eliminating the undesired model II of the YSP. Corollary 
4 says that there must be a model with local inertia at Result{Load, sO) for the 
Wait action, since this inertia is consistent with the YSP axioms. So there is 
a reduced automaton which is its model, viz., that in Figure 7. Moreover, it is 
not hard to verify that it is the only one with local inertia. The reason is that 
model II has a non-inertial triple {Wait, Loaded, Result{Load, s)) which can be 
made inertial and still satisfy the YSP axioms, so by Corollary 1, II cannot be 
a model of Circ{A{Ah); Ab; Result). 

Another standard scenario is the Stolen Car Problem [Baker 1991]. In it, 
there is only one fluent, say /, representing the presence of a car in the initial 
situation sO. The only action is Wait which has no formula constraining it, nor 
any effects, and is supposed to capture the passing of a day. However, in the 
theory one is told that -^Holds{f , Result {Wait, Result{W ait, s))), i.e., the car 
is gone at the end of the second day. So when was it stolen? Intuitively, there 
should be two models, one in which the fluent / changed at the end of the first 
day, and another in which it changed at the end of the second day, as shown in 
Figure 9. Incidentally, they are not Markovian. A circumscription policy should 
not eliminate either model, nor should it prefer one of them. [Baker 1991] shows 
that the policy of letting Result vary achieves this. 

For the Stolen Car Problem Corollary 4 guarantees the existence of a model 
for each of the two possible days when the car could be stolen. Then Proposition 
3 shows that they can only tranform to each other. 
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Two Ab-minimal models by varying Result 
Fig. 9. The Two Models of the Stolen Car Problem 



7 Conclusion 

We have shown that classical automata theory can be used to explain a success- 
ful albeit variant policy for circumscription in SC-based theories of action. This 
has several advantages. The first is that it exposes this policy as really a method 
for constructing succinct automaton models of SC theories. It has therefore the 
potential to simplify the logic of circumscription by reducing it to automaton re- 
alization and reduction, for which efficient and transparent algorithms exist. The 
second is that some ostensibly rather puzzling features of this circumscription 
policy are de-mystified as simple properties of reduced automata with locally 
inertial state transitions. 

This paper is an example of how the re-examination of circumscriptive models 
from a systems-theoretic perspective can clarify and highlight connections with 
standard computer science and engineering models. For instance, we alluded 
briefly to the Markov property, and its relationship with non-inertial actions. 
In fact, it can be shown that fully inertial systems are necessarily Markovian. 
In on-going work, we will delineate further similar considerations, among them 
simple explanations for certain delicate axioms that seem to be necessary for 
this circumscription. We also intend to show that it is closely connected with 
algebraic theories, thereby linking their models to algebras that have an ancient 
and familiar progeny. 
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Abstract. We extend the branching temporal logics CTL and CTL* 
with quantified propositions and consider various semantic interpreta- 
tions for the quantihcation. The use of quantificiation greatly increases 
the expressive power of the logics allowing us to represent, for example, 
tree-automata. We also show that some interpretations of quantification 
allow us to represent non-propositional properties of Kripke frames, such 
as the branching degree of trees. However this expressive power may also 
make the satisfiability problem for the logic undecidable. We give a proof 
of one such case, and also examine decidability in the less expressive se- 
mantics. 



1 Introduction 

Temporal logic has been particularly useful in reasoning about properties of 
systems. In particular, the branching temporal logics CTL* [4] and CTL [2] (a 
syntactic restriction of CTL*) have been used to verify the properties of non- 
deterministic and concurrent programs. 

It is our goal to extend these results to a logic that allows propositional 
quantification. In the linear case PLTL [16] has been augmented with quantified 
propositions to get the significantly more expressive QPTL [18]. In [9] QPTL 
was shown to be able to reason about w-automata, and prove the existence of 
refinement mappings [1]. A refinement mapping shows one system specification 
Si implements some other system specification S 2 by encoding the specifications 
in temporal logic where variables not common to Si and S 2 are quantified out. 
Propositional quantification has been shown to be related to logics of knowledge 
[8]. Finding decidable semantics for quantified propositional branching time log- 
ics is an important step to finding suitable semantics for the more expressive 
epistemic logics. 

Previously branching temporal logics with quantified propositions have been 
examined in [11], [4], and [13], though only existential quantification was consid- 
ered, and the structures were limited to trees. More powerful semantics for propo- 
sitional quantification have been studied in intuitionistic propositional logic [19], 
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[15]. In this paper we give three separate semantic interpretations for full propo- 
sitional quantification. We show that one of these, the Kripke semantics, is highly 
undecidable. While the other two are decidable, the tree semantics can be used 
to reason about purely structural properties of the model (like the branching 
degree of trees). This can complicate the notion of refinement, so we introduce 
the amorphous semantics to overcome this, and sketch a decision process. 



2 The Base Semantics, CTL* 

We first describe the syntax and semantics of CTL* . The language CTL* consists 
of an infinite set of atomic variables V = {xq, X\, ...}, the boolean operations V, 
the temporal operators O, D,U (next, generally and until respectively) and 
the path quantifier E. The formulae of CTL* are defined by the following abstract 
syntax, where x varies over V: 



a ::= X \ ~^a | ai V | 0<a | Da | aiUa 2 | Ea (1) 

Definition 1. A state formula of CTL* is a formula where every temporal op- 
erator (O, n or U) is in the scope of a path quantifier (E). 

The abbreviations A, ^ are defined as usual, and we define O a (future) to 
be ^ □ -la, and Aa to be -lE-^a. We also consider the formulas T, T (respectively 
“true” and “false”) to be abbreviations for, respectively XoV-iXq and -'(xqV-'Xo). 
To give the semantics for CTL* we define V-labeled Kripke frames: 



Definition 2. A Kripke frame is a tuple (S,R) where 

1. S is a nonempty set of states, or moments. 

2. i? C 52 is a total binary relation. 

A V-labeled Kripke frame is a Kripke frame with a valuation tt : S — > p(V) . 



Let M = {S, R, tt) be a V-labelled Kripke frame. A path 6 in M is an w-sequence 
of states b = (bo,bi,...) such that for all i, (bi,bi+i) G R and we let b>i = 
{b„b^+i,...). We interpret a formula a of CTL* with respect to a V-labelled 
Kripke frame M and a path b in M. We write M, b \= a where: 



M,b\= X X G 7r(6o) 

M, b 1= ~^a M, b a 
M,b \= a\/ f) <J==^ M,b \= a OT M,b \= f3 
M, b \= Q a M, b>i \= a 
M,b\= <J==^ Vi > 0, M, b>i ^ a 

M,b\= a\5(3 <;==> 3i > 0, M, b>i \= [3 and M, b>j \= a for all j < i 
M,b\= Ea there is some path b' s.t. 6q = bo and M, b' ^ a. 



( 2 ) 

(3) 

(4) 

(5) 

( 6 ) 

(7) 

( 8 ) 
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From the semantics given above we can see that the evaluation of a state 
formula only depends on the initial state of the path b, rather than the whole 
path. We restrict our attention to state formulas and define a model to be the 
tuple {S, R, 7T, s) (or (M, s)) where s G S'. A state formula a is satisfied by (M, s) 
(denoted (M, s) (= a) if there is some path b in M with bo = s and M, b\= a. If 
for all models (M, s) we have (M, s) ^ a, then we say a is a validity. 

The language CTL is a syntactic restriction of CTL*. Particularly, CTL 
requires that every temporal operator is paired with a path quantifier. To define 
CTL we only need to modify the abstract syntax (1) as follows: 

a ::= x \ ~^a | ai V 02 | EOct | E Qa | A Qa | E(aiUa 2 ) | A(aiUa 2 ) 

The logic CTL is less expressive than CTL*, though it is frequently preferred 
as the model checking complexity is less than that for CTL* . We will show that 
this difference disappears when propositional quantification is considered. 



3 Syntax and Semantics for Propositional Qnantification 

We add the operator 3 to the languages defined above and define QCTL* to be 
the language consisting of the formulae defined by the abstract syntax above, 
where the following inductive step is included: 

If a is a formula and x G V then 3xa is a formula. 

The set of formulae of QCTL* is closed under complementation, and we let Vxa 
be an abbreviation for Sx^a. The logic QCTL is similarly defined to extend 
CTL. 

The semantic interpretation of propositional quantification will rely on the 
following definition. 

Definition 3. Given some model (M, sq) = {S,R,n,so) and some x G C, an 
x-variant of (M,so) is some model (M',sq) = (S', i?, tt', sq) where 7r'(s)\{a;} = 
7r(s)\{a:} for all s G S. 

To complete the interpretation for formulae of QCTL* we augment the interpre- 
tation above (2-8) with 

M, b \= 3xa There is some x-variant M' of M such that M', b \= a. (9) 

Since we will consider other possible interpretations below we refer to the set 
of semantics defined above as the Kripke semantics. We will show that QCTL* 
becomes highly undecidable over such a general semantic interpretation. There 
are two possible ways to overcome this. 
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3.1 Tree Semantics 

The undecidability of the Kripke semantics results from the logic being able 
to specify structural properties of the model. By restricting the structures to 
trees, QCTL* becomes decidable. We give the following definitions and note the 
semantics for CTL* (2-8) can be restricted to V-labelled trees without change. 

Given some relation R C the transitive closure of R is <rC where 
(s,t) G<r if and only if for some n > 0 there exists sq, ■■■Sn G S with Sq = s, 
Sn — t and for i < n, {si, Si+i) G R. 

Definition 4. A V-labelled tree, {S, R, tt, s) is a model that satisfies the following 
conditions: 

1. S is a (countably) infinite set of nodes. 

2. <R is irrefiexive. 

3. The past oftGS, {s G <r t} is linerarly ordered by <r. 

4 . Each maximally ordered subset of S is order-isomorphic to N. 

We refer to this restriction as the tree semantics. Given any model, (M, s) = 
{S,R,tt,so) we can generate a V-labelled tree {M^,s) = (5", i?', tt', sq) (the 
unwinding of (M, s)) where: 

— S' C S* , So G S' , and for any word w G S* and any s G S with ws G S' then 
(s, t) G Rif and only if wst G S' . 

— R' = {(m, ws) \ w G S' ,ws G S' , s G S'}. 

— 7 t '( so ) = 7 t ( so ) and tt'{ws) = 7 t ( s ) for s G S. 

Lemma 1. CTL* is insensitive to unwinding. That is {M, s) (= a if and only if 
{M^,s) \= a. 

This was proven in [3]. However QGTL* is not insensitive to unwinding, as was 
shown in [11]. For example consider a model consisting of a single state. For all 
possible valuations a proposition would be always true or always false, which 
is clearly not the case in tree semantics. While QGTL* becomes decidable in 
the tree semantics, it can still define purely structural properties of the model. 
Particularly, QGTL (and hence QGTL*) can define the number of successors a 
state has. For every f € N we can define the state formula Bi such that M, s\= Bi 
if and only if s has exactly i successors. For example B 2 = 3 yB 2 {y) where 

B 2 {y) = ((EOj/ A E0“'2/) a Va;(EO(y A x) A EO(~'y f\x) ^ AOx)), (10) 

If there were only one successor of s, then EQy and EQ^y could not both be 
true. If there were more than two successors, we could have x true at exactly 
two of the successors, and y true at exactly one of these. Then the left side of 
the implication would be satisfied, but AQx would not. 
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3.2 Amorphous Semantics 

The second way to avoid the undecidability of QCTL* is to give a different 
interpretation of propositional quantificiation. This will remove the ability of 
QCTL* to define any of the structural properties of the underlying model. The 
new interpretation requires the following definition: 

Definition 5. Given X CV, the models (M, sq) = {S,R,tt,so) and {M',s') = 
{S' , R' , tt' , Sq) are X -bisimilar if there exists some relation B C S x S with 
(so, Sq) ^ ^ (®) ■^) € B: 

1. Ti{s)\X = tt' { s')\X . 

2. For all t G S such that (s,t) e R, there exists i! G S' with {s',t') G R' such 
that {t, t') G B. 

3. For all tf G S' such that {s',t') G R' , there exists t G S with (s,t) G R such 
that {t, t') G B. 

The pairs (M,b) and {M',b') are X-bisimilar if there exists such a relation B 
for (M,bo) and {M',b'o) with {h,b'^) G B for all i >0. We write {x}-bisimilar 
as x-bisimilar, and ll) -bisimilar as bisimilar. 

This definition is based on the notion of the bisimilarity of synchronization 
trees [14] and a similar notion of quantification has been considered in the case 
of intuitionistic propositional logic [19]. The amorphous semantics replace the 
interpretation of quantification (9) with 

M,b\= 3xa there is {M' , b'), x-bisimilar to {M, b), with M' , b' \= a. (11) 

The amorphous semantics allow us to disregard the purely structural proper- 
ties of the model (for example, the formula B 2 (10) becomes unsatisfiable). This 
is particularly useful for proving the refinement of concurrent specifications, since 
the specifications do not have to be considered over identical structures. We give 
the following lemma without proof, though it is not hard to show. 

Lemma 2. 1. X -bisimilarity is an equivalence relation. 

2. {M,s) and{M'^,s) are bisimilar. 

3. QCTL* is insensitive to unwinding in the amorphous semantics. 

4 Definability 

Before addressing the decidability of QCTL* we simplify the syntax we must 
use. Particularly we show that in the case of the tree and the amorphous seman- 
tics, QCTL* is definable in a restriction of QCTL that does not include the U 
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operators. We first claim that the U operator is definable by the other operators 
of QCTL*. The construction is taken from [9], and its soundness can easily be 
shown. 

Given some pair (M, b) where b — {bo,bi, ...) is a path in the model (M,bo), 
and formulas a and f3 of QCTL* which do not contain the variable Xi, let 
Until{a, [3) = 3xi(0 /3 A (xi A n(a;i ^ (/3 V (a A Oa;i))))). 



Lemma 3. In the tree and amorphous semantics (M,b) \= aU(3 ^ (M,b) ^ 
3xiUntil{a, P). 



We now show how every temporal operator can be paired with a path quan- 
tifier. By the above lemma we do not have to address the U operator. 

Let Ea be any formula of QCTL* such that any subformula of a containing a 
branch quantifier is a formula of QCTL. For some y G V that is not a variable of 
a we let a'(y) be the formula that results when all subformulas D/3 of a, which 
are not directly preceded by a path quantifier are replaced with A □(?/ — > P), 
and likewise for the O operator. Let 

a* = 3z{z A A ^ EO^) A Vy(y A A □(j/ ^ (EOj/ A z)) a'(y))). (12) 

Lemma 4. (M, s) |= Ea (M, s) \= a* , in the amorphous and tree semantics. 



Proof. (Tree Semantics) The formula (12) restricts the evaluation of cf{y) to 
models where y is true on a set of branches, and the formula A □ (y — > P) 
restricts the interpretation only to paths where y is true. If we can restrict y to a 
single path the interpretation becomes equivalent to □/?. Suppose (M, s) |= Ea. 
We can choose .2 to be true only on a single path for which a is true. Then y can 
only be true along this path and the result follows. Conversely if (M, s) |= a* 
then since we are considering all possible interpretations of y, then we must 
consider some interpretation which has y true on a single path, and Ea must 
be true. The O operator can be treated in the same way and the result follows. 
The proof for the amorphous semantics is similar, though the structure must be 
unwound first. 



Corollary 1. Given the amorphous semantics or the tree semantics, QCTE is 
definable in QCTL. 



Proof. Given some formula a we first remove all occurences of U by using Lemma 
3. The result can then be shown by induction over the complexity of formulas 
working from the inside out, and using the Lemma 4 to convert each branch 
quantified subformula to a formula of QCTL. 
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5 Decidability 

We have defined three possible sets of semantics for QCTL*. The satisfiabil- 
ity problem for QCTL* will be shown to be highly undecidable for the Kripke 
semantics, while it is decidable for the tree and the amorphous semantics. 



5.1 Undecidability 



The consequence of the expressive power of the Kripke semantics QCTL is that 
the satisfiablity problem becomes undecidable. In fact Kremer [15] has shown 
that intuitionistic propositional logic with quantified propositions over Kripke 
structures is recursively isomorphic to full second-order logic. It is belived that 
QCTL can be shown to be just as powerful, though here we simply show that 
QCTL (and hence QCTL*) is not recursively enumerable. This is done by en- 
coding the following tiling for (N, <) x (N, <) in QCTL. 

We are given a finite set F = = 1, ...,m} of tiles. Each tile has four 

coloured sides: left, right, top and bottom, written q), q*", ql, and q]'. Each 
side can be one of n colours Cj for j = 1, ...,n. Given any set of these tiles, we 
would like to know if we can cover the plane N x N with these tiles such that 
adjacent sides share the same colour. Formally, given some finite set of tiles F 
we would like to decide if there exists a function A : N x N — > F such that for 
all (x, ?/) G N X N 

1. X{x,yY = A(a;-P l,yY 

2. \{x,yy = X{x,y + lY 

where X{x, y)* is the colour of the top side of the tile on {x, y), and likewise for 
the other sides. Finally we require that there is some specific tile q^ that occurs 
infinitely often in the bottom row (i.e. X(x, 0) = q^ for infinitely many x. In [7] 
this problem was shown to be highly undecidable, or 

Theorem 1. Given the Kripke semantics, the satisfiability problem for QCTL 
is highly undecidable. 



Given the set of tiles F we give a formula, Tile^, of QCTL that is satisfiable 
if and only if the above tiling problem is satisfiable. To specify that some tile q,- 
occurs infinitely often in the bottom row, we let qo be a copy of qy, and suppose 
that qo occurs only in the bottom row. This is clearly equivalent to the above 
problem. 

We start by giving a formula that specifies the underlying Kripke structure 
to be a grid (i.e. a structure similar to a binary tree, but with the branches 
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rejoining). To make a grid we have to specify that the two successors of any 
state have a common successor. This is done with the following formula: 

G{y, z) = {yA EO(y A AQ^)) ^ EO(-J/ A EO(z A ^y)) (13) 

S{y) = A □ {B 2 {y) A V^(G(y, z) A G{^y, z)) (14) 

The formula B 2 {y) (10) specifies the branching degree to be two, and that y 
is true for exactly one successor of any state. This formula uses the universally 
quantified variable z to ensure that the two successors of any state share a 
successor, since any interpretation that makes z true for all the successors of 
the first successor of some point, makes 2 true for at least one successor of the 
second successor of that point. 

To encode the tiling we let each tile ji be represented by the variable U 
and define the formula Ti = ti A ^ ^ j^i ■ Rather than explicitly encoding the 
colours we just place restrictions on which tiles can succeed other tiles: 

m / 

C^{y) = \J R*AyAEO 

i=o y 

B^(y) = AD(-y^ AD-To) AEDEORo (16) 

A^{y) = B^{y) A A n(G^(y) V G^{^y)). (17) 



j<m \ / j<m \ \ 

y A \J Tj\ AEO j ~^y A V (15) 

, 7j=7f / V 7)=7* JJ 



The first formula specifies the way tiles can fit together. The variable y is used to 
define rows and columns since the value of y is fixed along a row, and alternates 
along a column. The second formula specifies that the variable to occurs only, 
and infinitely often, on the bottom row, since the only path where —ly is always 
true is the bottom row. We define the formula 



Tile^ = yAS{y) A A^{y). 



(18) 



Lemma 5. The formula Tile^ is satisfiable if and only if T can tile (N, <) x 
(N, <), with 7 o occuring infinitely often on the bottom row. 



Proof. ( — >) Suppose that Tile^ is satisfied by some model M = {S, R, tt, sq). 

We define the function ^ : N x N — > S recursively such that /i((0, 0)) = so> and 
/i((a, b)) — t where 

/x(a —1,6) = s, (s, t) G R and y G 7 r(s) ^ y G left) (19) 

or /r(a, 6 — 1) = s, (s, t) G R and y G 7t(s) ^ y ^ 7r(t) (20) 

The function will be surjective and well defined since every s G S has exactly 
two successors, by B 2 {y) and y will always be true on exactly one of these 
successors. Therefore there is always exactly one successor of s satisfying (19), 
and exactly one successor satisfying (20). We can then define the tiling function 
as A((a, 6 )) = 7 ^ <;=^ U G y{{a,b)). The use of R in A^{y) ensures that every 
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point (a,b) is assigned exactly one tile, and the function B^{y), along with the 
definition of fj, ensures that the tile 70 occurs infinitely often on the bottom row. 
All that is left to show is that the sides of the tiles match up. To do this suppose 
at some state fi{a, b), y and ti are true. Then by the definition of fi, at /x(a+ 1, b) y 
is true and by C^{y) there is some j such that tj is true where 'yl = 7J. Similarly 
at /r(a, 6+1) y is not true and there is some k with tk true where 7* = 7^. By 
the formula S{y), /i(a+ 1, 6) and y,{a, 6+1) have a common successor where y is 
not true. By the definition of y. this state must be /r(a+ 1, 6+ 1), and suppose ti 
is true at this state. The formula C^{y) ensures 7^, = 7^ and similarly C^{~^y) 
ensures yj = 7J. Likewise we can show the case for -ly, so the sides of any four 
adjacent tiles match up, and by applying a similar argument recursively we can 
see that the generated tiling is sound. 

(< — ) Given that A is a tiling for F of (N, <) x (N, <) we can construct the 
model M = {S, R, tt, sq) where S' = N x N, R= {((a, 6), (c, d))|c = a + 1 or d = 
6 + 1}, and Sq = (O,!!)- We define y G 7r((a, 6)) iff 6 is even, and ti G n{a,b) iff 
A(a, 6) = 7i. It is straightforward to show that (M, Sq) 1= Tile^ . 

This proof demonstrates the extensive expressive power of QCTL*. In fact 
the only formulae that were used were from QCTL, and there was only one 
propositional quantifier used. It is possible to give a similar proof where no path 
quantifiers are used, and hence QPTL [18] is undecidable when repeated states 
are allowed. 



5.2 Decidability 

The tree semantics are decidable. We do not go into details as the proofs are com- 
plicated [5]. The decidability can be shown by expressive equivalence with the 
language tree of automata [17], though this approach requires careful treatment 
as the branching degree of the model may not be fixed. In [4] the equivalence 
between the Rabin tree automata over binary trees and existentially quantified 
CTL* is shown. This does not relect the full the expressive ability of QCTL* 
over tree semantics since it does not allow for a varying branching degree. 

In [6] the decidability of a similar logic over trees was proven. In this proof 
it was shown that the formulas of the language could be transformed so the 
satisfaction of any formula could be decided over binary trees. This approach is 
also applicable in the case of QCTL*. 

To show QCTL* is decidable in the amorphous semantics we extend a method 
introduced in [9]. We define a new kind of tree automaton refered to as an 
amorphous automaton^ such that for any formula a of QCTL* an amorphous 

^ Amorphous tree automaton were defined in [12] to act on trees of varying branching 
degree. As they are a generalization of the construction presented here we will retain 
the name. 
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automaton Aa can be constructed that will accept exactly the models of a. The 
decidability of QCTL* then reduces to the emptiness problem for the automaton, 
which in turn reduces to the satisfiability problem for CTL*. An amorphous 
automaton A is given by the tuple (S, Q, go, S, C) where 

1. E = p{var{a)) is an alphabet, where var{a) is the set of the variables 
occuring in a. 

2. Q — {go, •■•gn} is a set of automaton states, (we refer to states of a model as 
moments from now on, to avoid confusion). 

3. go G Q is the initial state. 

4. 5 ■. Q X S — > p{p{Q)) is the transition faction. 

5. C = {{Li, Ui) I Li, Ui C is the Rabin acceptence condition. 



We define a run of the automaton A over some model (M, so) to be a Q- 
labelled tree (T, i?^. A, to) along with some function ^ : T ^ M such that: 

1. A : T — > Q, (i.e. each node is marked with a single automaton state). 

2. ^(to) = So and A(to) = go- 

3. If /r(t) = s and (s, s') G R then there is some t' G T such that (t,L) G R* 
and fi{t') = s'. 

4. If gt(i) = s and {t,t') G R^ then there is some s' G S' such that (s, s') G S' 
and fi{t') = s'. 

5. There is some set a G S{X{t),Tr{fj,{t))) such that a = {\{i!)\{t,t') G R^}, 

where tt' is the projection of tt onto the variables of a (i.e. \ S ^ E). 



Let r = (T, R', A, to, fJ-) be some run of the automaton A over a model (M, s). 
We say r is an accepting run if for every path 6 of r there is some i < k such 
that some state £ G Li occurs infinitely often along b and every state u G Ui 
occurs only finitely often along b. 

For each automaton A we can define a characteristic formula xa in QCTL*, 
such that M,s \= xa if and only if A accepts (M,s). To do this we use a set 
of variables P = {po, ...p„} to represent the automaton states, and define the 
formula at_qi = pi A ~^\/j^iPj for each state qi G Q and iri-Q' = VgeQ' aCgi 
and for each subset Q' C Q. For each element a G E we define the formula 
a = G cr} A G uar(a)\CT}. The formula xa is given by 



run 



= V V at.qAaA \J I AQfu-uA RQat-a' 



aGS{q,<7) 



k 

acc = A V ( DO in.Li A o □ -^inJJi) 



q'Ga 






Xa = 3po---3pn {at.qo A A \Zirun A acc) . 



( 21 ) 

( 22 ) 

(23) 



Lemma 6. An amorphous automaton A accepts some model {M, s) if and only 
if M,s\= XA- 
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This can be seen by comparision of the semantic defintions (2-8), (9) with the def- 
inition of the amorphous automata. Conversely for every formula a of QCTL* 
there is an automaton that accepts exactly the models that satisfy a. To 
construct the automaton we first convert a into a formula a' of QCTL, 
using Lemmas 3 and 4. Aa can then be constructed by induction over com- 
plexity of a'. We will give all the constructions for all operators except -i. The 
complementation construction is double exponential, though it could possibly 
be optimized, and it has similarities with Klarlund’s construction [10]. We let 
Ap = {E,Q,qQ,5,C) and A^ = (E,Q* ,ql,5* ,C*). 

1. Propositions. For a: G V we define A = (if, {go, 9i, 92}, <5, {({gi}, 0)}), 

where for all a G E S{q,a) = {{gi}} if g = go and x G a, or q = gi, and 
S{q,cr) = {{g2}} otherwise. 

2. /3V7. We define A = {E, QUQ*, q' , S', CUC*), where for alia G E 5'{q' , a) = 
S{qo, a)US{qQ,a), for all g G Q S'{q, a) = S{q, a) and for all q G Q* S{'q, a) = 
S*\q,a). 

3. EO/3- We define A = (E, Q U |g(|, g{}, q^, S', C U {({gi}, 0)}), where for all 
a G E S'{q'Q,a) = {{go,g'i}}, S'{q[,a) = {{g{}} and S'{q,a) = S{q,a) for 
q G Q. The construction for EO is similar. 

4. AO/3- We define A = (17, QU |g(,}, q^. S', C), where for all cr G if S'{qQ,a) = 

{{90, {90}, {go}} and S'{q, a) = S{q, a) for q G Q. 

5. 3x(3. We define A = {E,Q,qo,S' ,C), where for all cr G 17 and all g G Q 
S'{q, a) = S{q, cr\{x}) U S{q, a U {x}). 

The complementation procedure and proofs of soundness for the above con- 
structions will be given in [5] . To prove the decidability of the satisfiability prob- 
lem for QCTL* in the amorphous semantics we have to show that we can decide 
whether or not an amorphous automaton Aa accepts the empty language (i.e. a 
is unsatisfiable). Rather than constructing such a decision process it is enough to 
note that a is equivalent to xa^ ■ Since XA^ is an existentially quantified formula 
of QCTL* we can reason that xAc (and hence a) is satisfiable if and only if the 
unquantified part (i.e. (aLqg A A Drun A acc) from (21)) is satisfiable in CTL*. 
Since the CTL* is decidable the decidability of QCTL* follows and we are done. 



6 Conclusion 

We have defined three sets of semantics for QCTL*: Kripke semantics, tree 
semantics and amorphous semantics. While the Kripke semantics have been 
shown to be highly undecidable, there is good reason for further investigation. 
We have shown that QCTL* can reason about the structure of a model, but we 
do not yet know to what extent. For example, there is a formula of QCTL that 
is satisfied in the Kripke semantics by exactly the models that are trees. The 
variety of structures that are expressible may have applications in the theory of 
modal logic, or natural language processing. 
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In the case of the decidable semantics the complexity of the decision processes 
require further investigation. While there are some well known results in the case 
of the tree semantics, the amorphous semantics are relatively new and such issues 
are yet to be examined. The definition of amorphous automata is also of interest 
in its own right. The expressive power, complexity and applications are all yet 
to be fully explored. 



References 

1. M. Abadi and L. Lamport. The existence of refinement mappings. Theorectical 
Computer Science, 82(2):253-284, May 1991. 

2. E. Clarke and E. Emerson. Synthesis of synchronization skeletons for branching 
time temporal logic. In Proc. IBM Workshop on Logic of Programs, Yorktown 
Heights, NY, pages 52-71. Springer, Berlin, 1981. 

3. E. Emerson. Alternative semantics for temporal logics. TCS, 26, 1983. 

4. E. Emerson and A. Sistla. Deciding full branching time logic. Information and 
Control, 61:175 - 201, 1984. 

5. T. French. The theory of branehing time logics with quantified propositions. PhD 
thesis, Murdoch University, In preparation. 

6. Y. Gurevich and S. Shelah. The decision problem for branching time logic. J. of 
Symbolie Logic, 50:668-681, 1985. 

7. D. Harel. Effective transformations on infinite trees, with applications to high 
undecidability, dominoes, and fairness. J. A.C.M., 33(l):224-248, 1986. 

8. R. van der Meyden K. Engelhard! and Y. Moses. Knowledge and the logic of local 
propositions. In Conf. on Theoretical Aspects of Rationality and Knowledge, 1998. 

9. Yonit Kesten and Amir Pnueli. A complete proof systems for qptl. In Proceedings, 
Tenth Annual IEEE Symposium on Logic in Computer Scienee, pages 2-12, 1995. 

10. N. Klarlund. Progress measures, immediate determinacy, and a subset construction 
for tree automata. Annals of Pure and Applied Logie, 69:243-268, 1994. 

11. O. Kupferman. Augmenting branching temporal logics with existential quantifi- 
cation over atomic propositions. In Computer Aided Verifieation, Proe. 7th Int. 
Conferenee, pages 325-338, Liege, 1995. Springer- Verlag. 

12. O. Kupferman and O. Grumberg. Branching time temporal logic and amorphous 
tree automata. In Proeeedings of the Fourth Conferenee on Concurreney Theory, 
pages 262-277, Hildesheim, 1993. Springer- Verlag. 

13. O. Kupferman and A. Pnueli. Once and for all. In Proceedings of the Tenth IEEE 
Symposium on Logic in Computer Science, San Diego, 1995. 

14. R. Milner. A calculus of communicating systems. Leeture Notes in Computer 
Science, 92, 1980. 

15. P.Kremer. On the complexity of propositional quantification in intuitionistic logic. 
J. of Symbolic Logic, 62(2):529-544, 1997. 

16. A. Pnueli. The temporal logic of programs. In Proceedings of the Eighteenth 
Symposium on Foundations of Computer Science, pages 46-57, 1977. 

17. M. Rabin. Decidability of second-order theories and automata on infinite trees. 
Trans. AMS, 141:1-35, 1969. 

18. A. P. Sistla. Theoretical Issues in the Design and Verifieation of Distributed Sys- 
tems. PhD thesis, Harvard University, 1983. 

19. Albert Visser. Bisimulations, model descriptions and propositional quantifiers. 
Manuscript, see http://www.citeseer.nj.nec.com/visser96bisimulation.html. 




Improved Techniques for an Iris Recognition 
System with High Performance 



Gyundo Kee^, Yungcheol Byun^, Kwanyong Lee^, and Yillbyung Lee^ 



^ Dept, of Computer Science, Yonsei University, Seoul, Korea 
^ Dept, of Computer Software Research Laboratory, ETRI, Daejeon, Korea 
® Dept, of Information and Telecommunication, Korea Cyber University, Seoul, Korea 
{kigd, heart ,kylee,yblee}@csai .yonsei . ac .kr 



Abstract. We describe in this paper efficient techniques for iris recog- 
nition system with high performance from the practical point of view. 
These techniques range every step for an iris recognition system from 
the image acquisition step to the final step, the pattern matching, and 
contain as follows: a method of evaluating the quality of an image in the 
image acquisition step and excluding it from the subsequent processing 
if it is not appropriate, a bisection-based Hough transform method on 
the edge components for detecting the center of the pupil and localizing 
the iris area from an eye image, an elastic body model for transforming 
the localized iris area into a simple coordination system, and a compact 
and efficient feature extraction method which is based on 2D multireso- 
lution wavelet transform. By exploiting these techniques, we can improve 
the system performance in terms of computationally efficient, and more 
accurate and robust against noises. 



1 Introduction 

Controlling the access to secure areas or transacting electronically through the 
internet, a reliable personal identification infrastructure is required. Conven- 
tional methods of recognizing the identity of a person by using a password or 
cards are not altogether reliable. Biometrics measurements such as fingerprints, 
face, or retinal patterns are common and reliable ways to achieve verification of 
an individual’s identity with a high level of accuracy. It provides a better way for 
the increased security requirements of our information society than traditional 
identification methods such as passwords or ID cards. 

Since each individual has a unique and robust iris pattern, it has been con- 
sidered as a good information for the identification of individuals among the 
various biometrics features. The highly randomized appearance of the iris makes 
its use as a biometric well recognized. Its suitability as an exceptionally accurate 
biometric derives from its extremely data-rich physical structure, stability over 
time, and genetic independence - no two eyes are same [1] . 

Most of works on personal identification and verification by iris patterns have 
been done in 1990s, and recent noticeable studies among them include those of 
[1], [2] and [3]. 
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In this paper we present some of effective and efficient techniques for improv- 
ing the performance of human identification system based on the iris patterns in 
a practical point of views. To achieve the performance improvement of the sys- 
tems, we give the following techniques; an evaluation method for the quality of 
images in the image acquisition stage to determine whether the given images are 
appropriate for the subsequent processing or not and then to select the proper 
ones, a bisection-based Hough transform method in the iris localization stage for 
detecting the center of the pupil and localizing the iris area from an eye image, 
an efficient and robust transformation method called the elastic body model for 
converting the localized iris area into a simple image so as to facilitate the fea- 
ture extraction process, and a compact and effective feature extraction method 
which is based on 2D multiresolution wavelet transform. Through various ex- 
periments, we will show that the proposed methods can be used for iris-based 
personal identification systems in an efficient way. 

The contents of this paper are as follows. In the following section, some 
related works are briefly mentioned. Section 3 gives the details of the various 
methods we proposed in the paper. Experimental results and analysis will be 
stated in section 4, and finally the conclusions are given in section 5. 

2 Review of Past Work 

Some works on human iris recognition have been found in the literatures [1] ~ [4]. 
We will take a brief look at the overall process from some of the representative 
systems. 

Daugman used the circular edge detector to find out the boundaries and 
developed the feature extraction process based on information from a set of 2-D 
Gabor filter. He generated a 256-byte code by quantizing the local phase angle 
according to the outputs of the real and imaginary part of the filtered image, 
and compared by computing the percentage of mismatched bits between a pair 
of iris representation via XOR operator and by selecting a separation point in 
the space of Hamming distance. 

On the contrary, the Wildes system exploited the gradient-based Hough 
transform for localizing the iris area, and made use of Laplacian pyramid con- 
structed with four different resolution levels to generate iris code. It also exploited 
a normalized correlation based goodness-of-match values and Fisher’s linear dis- 
criminant for pattern matching. Both of the iris recognition systems made use 
of bandpass image decompositions to avail multiscale information. 

Boles used the knowledge-based edge detector for iris localization, and im- 
plemented the system operating the set of 1-D signals composed of normalized 
iris signatures at a few intermediate resolution levels and obtaining the iris rep- 
resentation of these signals via the zerocrossing of the dyadic wavelet transform. 
It made use of two dissimilarity functions to compare a new pattern and the 
reference patterns. 

Boles’ approaches have the advantage of processing 1-D iris signals rather 
than 2-D image used in both [1] and [2]. However, [1] and [2] proposed and 
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implemented a whole system for personal identification or verifications including 
the configuration of image acquisition device, but [3] only focused on the iris 
representation and matching algorithm without an image acquisition module. 



3 Analysis and Recognition of Iris Image 

3.1 Image Acquisition 

An image surrounding human eye region is obtained at a distance from a CCD 
camera without any physical contact to the device. To acquire more clear images 
through a CCD camera and minimize the effect of the reflected lights caused 
by the surrounding illumination, we suppose two situations as the surrounding 
lights; one exploits two halogen lamps locating on the right and left side of the 
camera at a distance, and the other uses two infrared lamps simply locating 
around the camera. The size of the image acquired under these circumstances is 
320x240. 

3.2 Evaluation of Image Quality 

For fully automated systems for recognizing iris patterns to identify a person, it 
is required to minimize person’s intervention in the image acquisition process. 
One simple way is to acquire a series of images within the specific interval and 
select the best one among them, but its approach is strongly required to have 
reasonable computational time for real applications. 

In this paper, we propose a method for checking the quality of images to de- 
termine whether the given images are appropriate for the subsequent processing 
or not and then to select the proper ones among them in real time. Some images 
asserted to inappropriate ones are excluded from the next processing. 

The images excluded from the subsequent processing include as follows; the 
images with the blink(Fig. 1(a)), the images whose the pupil part is not located 
in the middle thus some parts of the iris area disappear(Fig. 1(b)), the images 
obscured by eyelids or the shadow of the eyelids(Fig. 1(c)), and the images with 
severe noises like Fig. 1(d). Fig. 1 shows the examples of images with bad quality, 
and they can be caused to decrease the recognition rate and the overall system 
performance if they are excluded by the proposed method. 

We define some basic cases of the inappropriate images for the recognition 
and then develop straightforward and efficient sub-modules to deal with each case 
by considering the pixel distribution and the directional properties of edge only 
on regions of interest. Each sub- module is combined in parallel and sequential 
depending on the characteristics of information used in the sub- modules. Our 
approach has the great potential of extending the functional modules simply by 
adding the corresponding mechanisms. 

The eye image, at first, is divided into M x A blocks to get the pixel distribu- 
tion of the specific areas. The process of the quality evaluation consists of three 
stages by combining three sub-modules sequentially. The first stage is to detect 
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Fig. 1. Examples of images with bad quality 



the blink using the information that the intensity of the eyelids is lighter than 
those of the pupil and the iris. The second stage is related to detect the location 
of the pupil approximately. The brightness of the pupil area is darker than those 
of the other areas in the normal cases, accordingly the darkest block around the 
center of the given image would be the best candidate of the pupil area. After 
finding out the darkest block, we give the score to other blocks depending on the 
distance from the center of the image. The third stage is to get the vertical and 
horizontal edge components using Sobel edge detector to compute the ratio of 
directional components as the form of score. Just applying the threshold to the 
sum of the scores obtained from each stage, we can decide the appropriateness 
of the given image eventually. 



3.3 Iris Localization 

The iris localization is to detect the iris area between pupil and sclera from an 
eye image. To find out that area exactly, it is important to precisely detect the 
inner boundary (between pupil and iris) and the outer boundary(between iris 
and sclera). At first, we need to get the exact reference point, the center of the 
pupil, and then compute the distance from that point to the boundaries as the 
radius. 

We propose a three-step technique for detecting the reference point and local- 
izing the iris area from an eye image. In the first step, the Canny edge detector is 
applied to the image to extract edge components and then the connected compo- 
nents are labeled. The next step is to use a 2D bisection-based Hough transform, 
not a 2D gradient-based Hough transform [5], to get the center of the pupil. The 
basic idea of the bisection method is that any line connecting to two points on 
the circle is bisected the perpendicular line to that line which passes through 
the center of the circle. 

The frequency of each intersecting point among the perpendicular lines 
formed by two points at a specific distance on the edge components is com- 
puted. The most frequently intersected point above a threshold indicates the 
existence of a circle from the edge components, and the corresponding point can 
be considered as the center of the circle, the reference point. After detecting the 
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candidate of the center, the radius histogram technique is applied to validate the 
existence of a circle and calculate its radius. 




Fig. 2. Distance from the center to the tentative inner boundary 



In order to compute the radius of the tentative circle, the inner boundary, one 
simple method is to average all of the distance from the center to the points on 
the connected edge components, but its method is sensitive to noise. Therefore, 
we propose a new method what we called the maximal frequency determination. 
The method is to divide the possible range of radius into lots of sub-ranges, 
and to select a sub-range with the maximal frequency and then determine the 
median of the corresponding sub-range as the radius. By using this method, we 
can get the radius less sensitive to noise. After determining the radius, we can 
easily find the inner boundary using the center of the pupil and the radius. For 
the outer boundary, the similar process of getting the inner boundary is applied. 
Finally, the iris area can be localized by separating the part of an image between 
the inner boundary and the outer boundary. 

Fig. 3 shows each stage of the bisection-based Hough transform method. 



3.4 Normalization 

A normalization process is implemented to compensate for size variations due to 
the possible changes in the camera-to-face distance, and to facilitate the feature 
extraction process by converting the iris area represented by polar coordinate 
system into Cartesian coordinate system. 

As you know in Fig 2, the deviation of the radius is about 14%, which means 
there is the possibility of losing information in the localized iris area due to the 
inclusion of the pupil area and the exclusion of the iris area. It is worthwhile, in 
addition to scale compensation, to point out transformation compensation. 

To solve such problems caused by the use of virtual circle for the bound- 
aries, we propose a new normalization method, the elastic body model for scale, 
rotation, and transformation compensation. 
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Fig. 3. Each stage of the bisection-based Hough transform: (a)Original image (b)Edge 
detected image (c)Plot of the frequency of the intersecting points (d)Radius histogram 
(e)Detected inner boundary (f)Detected outer boundary 



Elastic Body Model. In the model, it is considered that there is one-to-one 
mapping between the inner boundary and the normalized mapping area despite 
of the actual distortions of the shape of the circle. Fig. 4 shows the conceptual 
diagram of the proposed method. We only consider the vertical direction of the 
inner boundary as the axis direction of an elastic body. The outer boundary 
corresponds to the outer frame of the elastic body, and the inner boundary 
corresponds to the edge of free movement of independent spring. Each point of 
the inner boundary connects to the corresponding point of the outer boundary 
by each spring. 

We put two assumptions on the movement of iris muscles to the model. One 
is that the iris muscles consists of the elastic body connected by the pin joint of 
the outer frame, and the other is that the elastic bodies (iris patterns) can be 
transformed only in the direction of each spring, not the perpendicular direction, 
which means it is not permitted the bend of each spring. 

The algorithm to apply the model to iris images is briefly described in Fig. 5 
and Fig. 6. 



In the Fig. 6, is given by 
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Fig. 4. Proposed normalization method: Elastic Body Model 




Fig. 5. Relationship between inner boundary and outer boundary from the elastic 
body’s viewpoint 



3.5 Feature Extraction 

Various wavelets described in the literature have been reported to posses dif- 
ferent properties of orthogonality, symmetry and compact support. The wavelet 
paradigm is now well established and has found many applications in signal and 
image processing [7]. We selected Daubechies’s wavelet (tap-4), because this has 
shown high texture classification. This wavelet is based on orthogonalization and 
factorization conditions, and not symmetric but provide compact support [8]. 

Multiresolution techniques intend to transform images into a representation 
in which both spatial and frequency information is present [6]. In our scheme, 
all the iris images are first decomposed into subbands using wavelet transform. 
With the pyramid-structured wavelet transform, the original image is passed 
through the low-pass and high-pass filters to generate the low-low, low-high, 
high- low and high-high subimages. The decomposition is recursively applied on 
the low frequency channel to obtain the lower resolution subimages. Thus, we 
can analyze an iris image at both local and global scales simultaneously using 
the multiresolution approach. 

Mallat’s experiment [6] suggests that by using wavelet representation, statis- 
tics based on first order distribution of grey-levels might be sufficient for preat- 
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Fig. 6. The normalization process of elastic body model 



tentive perception of textual difference. Hence we use four features i.e. mean, 
variance, standard deviation, and energy from the gray-level histogram of the 
subband images. 

In this paper, each iris image was decomposed into three levels using 
Daubechies tap-4 filter which resulted in 12 subimages so as to extract iris fea- 
tures. We used the statistical features to represent feature vectors, thus four sta- 
tistical features were computed from each subband image. In addition to that, 
we divide the subimages into local windows in order to get robust feature sets 
against shift, translation and noisy environment (Fig. 7). We extracted statisti- 
cal features from local windows on the corresponding subimages, the subimages 
of the intermediate levels, to represent feature vectors. 



3.6 Pattern Matching 

The process of pattern matching consists of two phases: training and classifica- 
tion. In the training phase, we construct the registered patterns corresponding 
to the enrollment process of an iris recognition system based on a set of features 
obtained from the wavelet transformation on images. In the classification phase, 
the feature representation of an unknown iris is constructed in order to compare 
with the registered ones for identifying a person. 

Denote the registered pattern by • • • , yi,j),i = 1, • ’ ’ i and an 

unknown image pattern by x = [xi, - ■ ■ ,xj). Then we calculate the distance 
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Fig. 7 . Arrangement of feature vector by local windows 



between the two patterns defined by the discrimination function in the feature 
space. The discrimination function for iris texture is listed by 

Di = distance{x, y^) . (2) 

Several distance functions can be used in Eq.(2). We consider, first of all, the 
Euclidean distance function defined as 



Di,i = [(x - y*)'^(x - = 
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The Euclidean distance between two features, however, can be greatly influ- 
enced by variables that have largest values [9]. Thus we consider that a more 
robust alternative in the presence of outliers is to divide the values by the stan- 
dard deviation to reduce the effect of extreme values on the feature typical 
cases. Registered patterns are normalized by centering each component around 
its mean rrij and then scaling it by the inverse of its standard deviation Uj, 



m,3 = 




An unknown pattern is centered and scaled by 
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The mean rrij for each component is computed by rrij = Xij /M and the 

variance is obtained via the unbiased estimator cr| = ~ xnj)/{M — 1). 

Therefore, the distance H 2 ,i on the normalized features can be expressed as 
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To determine which iris class provides the best representation of an input im- 
age, we select the class with the minimum distance among all of the results for 
the registered vectors by Eq.(3) and (6). Thus an unknown iris will be matched 
with a specific registered sample if the degree of dissimilarity between the cor- 
responding sample and an unknown one is the smallest distance in comparison 
with other samples. 



4 Experimental Results 

We use two kinds of iris data sets acquired from different environment for real 
applications. The first data set is obtained with two hallogen lamps as the sur- 
rounding illumination under irregular indoor lights, and is composed of 4500 iris 
data acquired from 150 persons(Data set 1). The second one is acquired from a 
constant illumination with two infrared lamps, and consists of 600 iris data from 
20 persons(Data set 2). We used four samples from each other for training data, 
and the remaining ones are used for test data. 

For the experiment of evaluating the image quality, the processing time on 
a Pentium-Ill 450 MHz PC with the windows 2000 is about 0.2 second. Table 1 
shows the processing time according to each stage, and our method consists 
of three stages such as the blinking detection(Fl), the detection of pupil loca- 
tion(F2), and the computation of the edge component ratio(F3). 



Table 1. Processing time for evaluating the image quality 



Step 


FI 


Fl-tF2 


Fl-tF2-fF3 


Processing Time(sec) 


0.04 


0.8 


0.16 



For the iris localization stage, first of all, the images passed successfully from 
the quality evaluation were applied. When we applied the preprocessing tech- 
niques such as the bisection-based Hough transform and the elastic body model 
to these images, the success rate of the preprocessing stage was 97.3% from 
the subjective viewpoint. Some of the major causes of failing to iris localiza- 
tion include the distortion of the inner boundary by the intrusion of eyelashes, 
and inconsistency between the original image and the extracted outer bound- 
ary by the noises. On the contrary, all of the eye images which were asserted 
the inappropriate ones by the quality evaluation stage give the imperfect results 
even though they seemed to finished the localization process successfully. From 
these results, we noticed that the reliability and performance of the system was 
improved due to the evaluation module of image quality. 

Fig. 8 illustrates the direction of all the vertical lines after compensating by 
the elastic body model. We noticed that the distorted edge components caused 
by reflected light made a convex shape by compensation. 
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(a) Original Image (b) Before compensation (c) After compensation 



Fig. 8. Direction of vertical lines after compensating by elastic body model 



From the practical viewpoints, it is strongly required for the systems to have 
the compact dimension of features as well as the best accuracy and reliability. 
To achieve these requirements, we conducted the four experiments to select the 
best strategy of selecting features. The first one is to arrange the statistical 
information of the entire subimages into a feature vector. The second experiment 
is to make a feature vector by mutually combining the decomposition coefficients 
of the low-level subimages. The third approach is to get a feature vector by 
combining the statistical information for some of the low-level subimages and 
the decomposition coefficients of the low-level subimages. The final thing is to 
combine the statistical information obtained from the low-level subimages by 
applying local windows. 




Fig. 9. Comparison of results using different feature vectors and distance function 



Fig. 9 shows the recognition rate on the four above-mentioned experiments. 
As you can see the figure, we got the best recognition rate when exploiting a 
feature vector combined by mean and standard deviation value of local windows 
for the intermediate subbands. 
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5 Conclusions 

Through the experiments of the proposed method for evaluating the image qual- 
ity, we confirmed that it could check the quality of images and determine their 
appropriation in real-time, and improve the performance by excluding the un- 
necessary and improper images from the subsequent processing, accordingly. 

The bisection-based Hough transform for detecting the centre of the circle 
and extracting the radius of the detected circular shape form iris images is 
more robust to noise than the existing methods while being more accurate. 
Furthermore, the proposed model of elastic body allows compensation for the 
transformation variations of iris shape resulting from asymmetric constriction 
and dilation, and for the rotation which is deviation in angular position about 
the optical axis. 

By selecting some intermediate resolution levels based on multiresolution 
wavelet approach to get iris features, we can get a more compact features with 
robustness in a noisy environment, and reduce the computation time as the 
lower frequency bands are subsampled successively without loss of information. 
In addition, by normalizing the statistics values acquired from local windows of 
subimages, we achieve robustness against mismatches due to the shift and noise. 

We showed that the proposed methods can be easily applied to the real 
problems of iris-based identification system in an efficient manner. 
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Abstract. Other work has shown that adaptive learning can be highly 
successful in developing programs which are able to play games at a level 
similar to human players and, in some cases, exceed the ability of a vast 
majority of human players. This study uses poker to investigate how adaptation 
can be used in games of imperfect information. An internal learning value is 
manipulated which allows a poker playing agent to develop its playing strategy 
over time. The results suggest that the agent is able to learn how to play poker, 
initially losing, before winning as the players strategy becomes more 
developed. The evolved player performs well against opponents with different 
playing styles. Some limitations of previous work are overcome, such as deal 
rotation to remove the bias introduced by one player always being the last to 
act. This work provides encouragement that this is an area worth exploring 
more fully in our future work. 



1. Introduction 

Game playing has a long research history. Chess has received particular interest 
culminating in Deep Blue beating Kasparov in 1997, albeit with specialized hardware 
(Hamilton, 1997) and brute force search. However, although arguably, being a ‘solved 
game’ chess still receives interest as researchers turn to adaptive learning techniques 
which allow computers to ‘learn’ to play chess, rather than being ‘told’ how it should 
play (Kendall, 2001). Adaptive learning was being used for checkers as far back as 
the 1950’s with Samuel’s seminal work (1959, re-produced in Samuel, 2000). 
Checkers research would lead to Jonathan Schaeffer developing Chinook, which 
claimed the world title in 1994 (Schaeffer, 1996). Like Deep Blue, it is arguable if 
Chinook used AI techniques. Chinook had an opening and ending database. In certain 
games it was able to play the entire game from these two databases. If this could not 
be achieved, a form of mini-max search, with alpha-beta pruning was used. Despite 
Chinook becoming the world champion, the search has continued for an adaptive 
checkers player. Chellapilla and Fogel’s (Chellapilla, 2000) Anaconda was named 
due to the strangle hold it placed on its opponent. It is also named Blondie24, this 
being the name it used when competing in internet games (Fogel, 2001). Anaconda 
uses an artificial neural network (ANN), with 5000 weights, which are evolved by an 
evolutionary strategy. The inputs to the ANN are the current board position and it 
outputs a value which is used in a mini-max search. During the training period, using 
co-evolution, the program is given no information other than whether it won or lost. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 189-200, 2001. 
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Once Anaconda is able to play at a suitable level, it often searches to a depth of 10, 
but depths of 6 and 8 are also common in play. Anaconda has been available to the 
delegates at the Congress on Evolutionary Computing (CEC) conference for the past 
two years (CEC’OO, San Diego and CEC’Ol, Seoul) with Fogel offering a prize of 
$100 (CEC’OO) and $200 (CEC’Ol) to anybody who could defeat it. The prize 
remains unclaimed and at the next conference (CEC’02, Hawaii), the prize rises to 
$300. 

Poker also has an equally long research history with von Neumann and Morgensten 
(von Neumann, 1944) experimenting with a simplified, two-player, version of poker. 

Eindler (Findler, 1977) studied poker, over a 20 year period. He also worked on a 
simplified game, based on 5-card draw poker with no ante and no consideration of 
betting position due to the computer always playing last. He concluded that dynamic 
and adaptive algorithms are required for successful play and static mathematical 
models were unsuccessful and easily beaten. 

In more recent times three research groups have been researching poker. Jonathan 
Schaeffer (of Chinook fame) and a number of his students have developed ideas 
which have led to Loki, which is, arguably, the strongest poker playing program to 
date. It is still a long way from being able to compete in the World Series of Poker 
(WSOP), an annual event held in Las Vegas, hut initial results are promising. 
Schaeffer’s work concentrates on two main areas (Billings, 1998a and Schaeffer, 
1999). The first research theme makes betting decisions using probabilistic 
knowledge (Billings, 1999) to determine which action to take (fold, call or raise) 
given the current game state. Billings et. al. also uses real time simulation of the 
remainder of the game that allows the program to determine a statistically significant 
result in the program’s decision making process. Schaeffer’s group also uses 
opponent modeling (Billings, 1998h). This allows Loki to maintain a model of an 
opponent and use this information to decide what betting decisions to make. 

Koller and Pfeffer (Koller, 1997), using their Gala system, allow games of 
imperfect information to be specified and solved, using a tree based approach. 
However, due to the size of the trees they state “. . . we are nowhere close to being able 
to solve huge games such as full-scale poker, and it is unlikely that we will ever be 
able to do so.” 

Luigi Barone and Lyndon While recognise four main types of poker player; Loose, 
Tight, Passive, and Aggressive. These characteristics are combined to create the four 
common types of poker players: Loose Passive, Loose Aggressive, Tight Passive and 
Tight Aggressive players (Barone & While, 1999; 2000). A Loose Aggressive player 
will overestimate their hand, raising frequently, and their aggressive nature will drive 
the pot higher, increasing their potential winnings. A Loose Passive player will 
overestimate their hand, but due to their passive nature will rarely raise, preferring to 
call and allow other players to increase the pot. A Tight Aggressive player will play 
to close constraints, participating in only a few hands which they have a high 
probability of winning. The hands they do play, they will raise frequently to increase 
the size of the pot. A Tight Passive player will participate in few hands, only 
considering playing those that they have a high probability of winning. The passive 
nature implies that they allow other players to drive the pot, raising infrequently 
themselves. 
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In their first paper Barone and While (Barone, 1998) suggest evolutionary 
strategies as a way of modelling an adaptive poker player. They use a simple poker 
variant where each player has two private cards, there are five community cards and 
one round of betting. This initial work incorporates three main areas of analysis; hand 
strength, position and risk management. Two types of tables are used, a loose table 
and a tight table. The work demonstrates how a player that has evolved using 
evolutionary strategies can adapt its style to the two types of table. 

In (Barone, 1999) they develop their work by introducing a hypercube which is an 
n dimensional vector, used to store candidate solutions. The hypercube has one 
dimension for the betting position (early, middle and late) and another dimension for 
the risk management (selected from the interval 0..3). At each stage of the game the 
relevant candidate solutions are selected from the hypercube (e.g. middle betting 
position and risk management 2) and the decision is made whether to fold, call or 
raise. To make the decision the hypercube entry holds seven real valued numbers 
which are used as constants to three functions (fold, call and raise). In effect, the 
functions lead to a probability of carrying out the relevant action. It is the seven real 
values that are evolved depending on whether the player won the hand or not. Barone 
reports that this poker player improves on the 1998 version. Their 2000 paper 
(Barone, 2000) extends the dimensions of the hypercube to include four betting 
rounds (pre-flop, post-flop, post-turn and post-river) and an opponent dimension so 
that the evolved player can choose which type of player it is up against. The authors 
report this player out performs a competent static player. 

Poker, being a game of imperfect information, is interesting as a game for the basis 
of research. Unlike chess and checkers, poker has some information that is unseen. 
Poker also contains other unknowns such as the playing styles of the other players 
who may use bluffing (and double bluffing) during the course of the game. These 
elements add to the research interest. Unlike complete information games where the 
techniques to solve the games (computational power allowing) have been known and 
understood for a long time (such as mini-max search and alpha-beta pruning), games 
of imperfect information have not received the same sort of analysis and, doing so, 
could prove relevant to many other areas such as economics, on-line auctions and 
negotiating. 



2. The Rules of Poker 

The exact rules for poker can be found in many poker books (see, for example, 
Sklansky, 1994; 1996) and we simply give here the basic rules of one variant (Texas 
Hold ‘Em) so that the reader is able to follow the remainder of this paper. Each player 
is dealt two cards. These are private cards, only being visible to the player receiving 
those cards. These cards are normally referred to as hole cards. A round of betting 
follows this initial deal. Next, three community cards (called the flop) are dealt, face 
up, in the middle of the table. These cards are used by every player to make the best 
five card poker hand, using their hole cards. A round of betting follows the flop. Next, 
another community card (called the turn) is dealt face up in the middle of the table. 
Another round of betting follows. Einally, another community card (called the river) 




192 G. Kendall and M. Willdig 



is dealt and a final round of betting follows. Once this final round of betting has taken 
place, assuming there are two or more players who still have an interest in the pot, the 
cards are shown and the highest poker hand wins. In forming a poker hand, the 
players can use any combination of their two hole cards and the five community cards 
to make the best five card poker hand. The various poker hands are as follows, in 
descending order. 

Royal Flush: Ten, Jack, Queen, King and Ace, all in the same suit. 

Straight Flush: any sequence of five cards, all of the same suit. 

Four of a Kind: four cards having the same value, one from each suit. 

Full House: three cards of the same value combined with two cards of the same 
value. For example. Three 2’s and a pair of Queens. 

Flush: all five cards have the same suit. 

Straight: all five card values are in sequence, made up from at least two suits. 

Three of a Kind: three cards all having the same value. 

Two Pairs: two cards of the same value, combined with another two card of the same 
value. For example, two 9’s and two 3’s. 

A Pair: two cards having the same value. 

Single Card: the highest value card is used to value the hand. 

When betting, the players have three choices to make. They can either fold (throw 
in their cards and relinquish all claims to the money in the pot), they can call (match 
the amount of money bet so far) or they can raise (increase the current bet, thus 
forcing all the other players to match this amount or fold). To start the betting it is 
usual to put in some form of ante. This is a mechanism to start the betting by giving 
the players an interest in the pot. 

In this paper we have not implemented a full version of Texas Hold ‘Em, 
preferring a version of poker, where the players are dealt five cards and, after a round 
of betting, are allowed to trade two cards before a final round of betting. This version 
is known as draw poker and was considered as a suitable test bed for this initial 
investigation. 



3. Experiments 

We have implemented the four playing styles (loose passive, loose aggressive, tight 
passive and tight aggressive) described above so that we can sit each of them at our 
tables and find out if our approach can adapt to each of these styles. Each playing 
style will play to a specific set of rules using the value of their current hand and the 
current value of the pot to decide whether to fold, call, or raise. 
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Table 1: The Loose Aggressive Players Strategy 



1st Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair 8’s 


Fold 


Pair 9’s 


Pair K’s 


Call 


Pair A’s 


2 Pairs Ace High 


Raise 5 if Pot <= 100 otherwise Call 


Three 2’s 


Three 4’s 


Raise 10 if Pot <= 150 otherwise Call 


Three 5’s 


Three J’s 


Raise 15 if Pot <= 200 otherwise Call 


Three Q’s 


Three A’s 


Raise 20 if Pot <= 250 otherwise Call 


Straight 


Royal Flush 


Raise 25 if Pot <= 300 otherwise Call 


2nd Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair 8’s 


Fold 


Pair 9’s 


Three 6’s 


Call 


Three 7’s 


Three A’s 


Raise 5 if Pot <= 150 otherwise Call 


Straight 6 High 


Straight A High 


Raise 10 if Pot <= 200 otherwise Call 


Flush 6 High 


Full House A High 


Raise 15 if Pot <= 250 otherwise Call 


Four 2’s 


Four A’s 


Raise 20 if Pot <= 300 otherwise Call 


Straight Flush 6 High 


Royal Flush 


Raise 25 if Pot <= 400 otherwise Call 



Table 2: The Loose Passive Players Strategy 



1st Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair 8’s 


Fold 


Pair 9’s 


Three J’s 


Call 


Three Q’s 


Flush A High 


Raise 5 if Pot <= 100 otherwise Call 


Full House 2 High 


Royal Flush 


Raise 10 if Pot <= 150 otherwise Call 


2nd Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair 8’s 


Fold 


Pair 9’s 


Three A’s 


Call 


Straight 6 High 


Straight A High 


Raise 5 if Pot <= 100 otherwise Call 
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Flush 6 High 


Four 5’s 


Raise 10 if Pot <= 150 otherwise Call 


Four 6’s 


Royal Flush 


Raise 15 if Pot <= 200 otherwise Call 



Table 3: The Tight Aggressive Players Strategy 



1st Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair A’s 


Fold 


2 Pairs 3 High 


Three 4’s 


Call 


Three 5’s 


Three J’s 


Raise 5 if Pot <= 150 otherwise Call 


Three Q’s 


Three A’s 


Raise 15 if Pot <= 200 otherwise Call 


Straight 6 High 


Royal Flush 


Raise 25 if Pot <= 300 otherwise Call 


2nd Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair A’s 


Fold 


2 Pairs 3 High 


Three lO’s 


Call 


Three J’s 


Three A’s 


Raise 5 if Pot <= 150 otherwise Call 


Straight 6 High 


Straight A High 


Raise 10 if Pot <= 200 otherwise Call 


Flush 6 High 


Full House A High 


Raise 15 if Pot <= 250 otherwise Call 


Four 2’s 


Four A’s 


Raise 20 if Pot <= 300 otherwise Call 


Straight Flush 6 High 


Royal Flush 


Raise 25 if Pot <= 400 otherwise Call 



Table 4: The Tight Passive Players Strategy 



1st Round Strategy 




Hand From 


Hand To 


Action 


0 


Pair A’s 


Fold 


2 Pairs 3 High 


Straight A High 


Call 


Flush 6 High 


Four 5’s 


Raise 5 if Pot <= 100 otherwise Call 


Four 6’s 


Straight Flush A High 


Raise 10 if Pot <= 150 otherwise Call 


Royal Flush 


Royal Flush 


Raise 15 if Pot <= 200 otherwise Call 


2nd Round Strategy 




Hand From 


Hand To 


Action 


0 


2 Pairs A High 


Fold 


Three 2’s 


Three A’s 


Call 


Straight 6 High 


Flush A High 


Raise 5 if Pot <= 100 otherwise Call 
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Full House 2 High 


Four A’s 


Raise 10 if Pot <= 150 otherwise Call 


Straight Flush 6 High 


Royal Flush 


Raise 15 if Pot <= 250 otherwise Call 



In order to test our adaptive poker player, we adopt the following rules. At the start 
of each hand each player places an ante of one unit into the pot. There will be two 
rounds of betting. Each round will pass around the table a maximum of three times, 
unless every player except one decides to fold, or all players call. 

Non-evolving players will play to the strategies described above (tables 1 thru 4). 
The evolving player considers three factors when deciding whether to fold, call or 
raise, these being hand strength, the number of players left at the table and the money 
in the pot. As well as these factors, a learning value will be evolved and will also 
dictate the actions of the evolving player. The learning value is manipulated 
throughout the training period of the evolving player, assisting in its decision whether 
to fold, call or raise. There is a learning value (ranging over the interval 1..10) 
associated with each possible hand. The algorithm, in deciding whether to fold, call or 
raise is as follows. 

If Iv < 6 then FOLD 

elseif Iv >= 6 AND Iv < 8 then CALL 
elseif Iv >= 8 then 

ac = ( 1 v/LOG (pv) / (np/lv) 
if ac < 10 then CALL 
else RAISE by SQRT(ac) * w 

where 

Iv = learning value for the hand being played 
pv = the current value of the pot 
np = number of players left in the current game 
ac = players action, returning a value greater than 0 
w = a weighting factor dependant on Iv 

if Iv < 8 then w = 1 

if 8 < Iv > 8.99 then w = 3 

if 9 < Iv > 9.99 then w - 4 

if Iv > 9.99 then w = 5 

Example of the use of this algorithm is shown in figures 1 thru 3. 



Iv = %,np = 5, pv = 50 


Iv = 8,np = 3, pv = 50 


ac = 7.53, player will call 


ac = 12.57, player will raise 1 1 units 



Fig. 1: Player Calls Fig. 2: Player Raises 1 1 units 

Iv = 9,np = 3, pv = 50 
ac = 15.89, player will raise 16 units 



Fig. 3: Player Raises 16 units 

Eigure 1 has more players participating in the current game resulting in the 
evolving player calculating it is less likely to win the pot so it decides to call. A 
reduced number of players contesting the pot increases the evolving players chances 
of winning, influencing its action to raise, as shown in figure 2. 
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Figures 3 highlights the difference in the raise value when the learning value is 
adjusted between the values of 8 (figure 2) and 9 (figure 3). The learning value of 9 is 
associated with better hand rankings, thus there is a better chance of winning. As the 
possibility of winning is increased with the higher learning value, more emphasis is 
placed on driving the pot harder, raising it, in the hope of increased winnings. 

The learning values, /v, are associated with hand strength. Each hand is given a 
value, Iv, which is used in the formulae outlined above. Initially, all values are set to 
10, so that the evolving player will raise every time. Using this method every hand is 
assumed to be good until we find out otherwise. This is seen as preferable to 
assuming every hand is bad until we know otherwise as this was one of the criticisms 
that Barone made of his own work. He experienced a royal flush so infrequently that 
he folded it when one did appear, on the basis that the program had not learnt that this 
was good hand. 

Our adaptation technique is simple. If the evolving player wins a game, with other 
players either calling or raising, then the learning value is incremented by 0.1, but will 
never exceed 10. If the player folds after raising or calling the learning value will be 
decreased by 0. 1 unless it is already zero. 

All our tests have five players seated at the table. Player 1, except for initial testing 
to confirm the system is operating fairly, will always be the evolving player. Players 2 
thru 5 will be the non-evolving players as defined in tables 1 thru 4. Each player will 
have 10,000 units allocated to them making a total of 50,000 units at the table. The 
evolving player will have its learning values initialised to ten at the start of each 
training session. The deal and betting will move clockwise around the table. The 
player to the dealers left will always play first. Initial testing, using tables of similar 
players with no evolving player, showed that the game was fair, in that no single 
player or position dominated. 

The evolving player must initially be trained by allowing manipulation of the 
learning values. It is interesting to monitor the evolving player during this learning 
period. 

Initially, the evolving player plays every hand. This can be seen in epochs (hands 
played) 1 to 100 in table 5. After this, learning values are being reduced and the 
number of hands played gradually decreases. 



Table 5 : Number of Games Played and Won During the Training Period 





25 


50 


75 


100 


200 


300 


400 


500 


1000 


1500 


2000 


2500 


3000 


Hands Played 


25 


50 


75 


100 


195 


257 


324 


393 


647 


925 


1187 


1421 


1664 


Hands Won 


5 


12 


19 


28 


60 


81 


102 


123 


216 


312 


409 


484 


560 


% Played 


100 


100 


100 


100 


97.5 


85.6 


81.0 


78.6 


64.7 


61.6 


59.3 


56.8 


55.5 


% Won 


20.0 


24.0 


25.3 


28.0 


30.7 


31.5 


31.5 


31.3 


33.3 


33.7 


34.5 


34.0 


33.6 



Table 5 also shows another interesting result. The percentage of hands played 
gradually falls, yet the number of hands won increases, demonstrating that the player 
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is learning. At the end of the training period less than half of the hands are being 
played with over 33% of them winning. 

Next we consider how the training process affects the number of units the evolving 
player wins or loses over a specific time period. As highlighted above, the player will 
soon realise that playing in every pot (and raising it, due to the high learning values) 
is not the best method of playing poker. As the player begins to adapt, the losing 
streak eventually levels off and changes into a winning streak, creating a better player, 
maximising its winnings against a variety of players. Figures 4 and 5 show how the 
program learns to play poker against two different tables, where a table consists of 
players of the same playing style. 
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Fig. 4: Learning Curve, against Loose Aggressive Players 

From figure 4 and 5 an obvious losing trend can be seen, particularly in figure 5. 
The evolving player initially loses, before the graph levels off and then rises. Figure 4 
has an initial losing period until epoch 350, when the losing streak begins to level off 
as the learning starts to have an impact. By epoch 1300 the learning process is almost 
complete, and the program begins to win and eventually wins more units than it 
initially started with. Figure 5 takes slightly longer to learn, the initial losing streak 
continues until epoch 650. This losing streak levels off until epoch 2650, when the 
learning process allows the player to regain its earlier losses and by epoch 3600 the 
evolving player is back in the black. Figure 4 and 5 emphasises that learning against 
a table of Loose Aggressive players is quicker than that of a table of Tight Aggressive 
players due to the fact that tight players play less hands themselves. In addition, the 
evolving player wins more money against a loose player than a tight player. 
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Number of Epochs 



Fig. 5: Learning Curve, against Tight Aggressive Players 



Table 6 shows the results when the evolved player (i.e. after training) is played 
against players of a single style, after a training period of 5000 epochs. A value of 
5000 was chosen as it appears that a player can be trained in about 4000 epochs 
(figures 4 and 5). The figure of 5000 was chosen as an insurance against slow 
learning due to an unfavourable distribution of cards. The results in table 6 are played 
over 1000 hands, averaged over five runs 



Table 6: Units Won by each player (Player 1 is the evolving player) 



Player 


Loose Aggressive 
( P'‘‘y'=‘Vwon) 


Tight Passive 
(P'">''=‘‘/won) 


Loose Passive 
(P'"5'<=‘i/^on) 


Tight Aggressive 
(P'"y'=‘‘/won) 


1 


13342 "“"/i53 


11123 '”/i96 


15632 "“/i57 


10816 '’'‘"/le? 


2 


9696 "“/i96 


9536 


8520 ^'7200 


9527 


3 


8602 


9459 


8724 


9730 


4 


8879 


9420””%~ 


8189 


9837 '“Vi22 


5 


9496 "‘*Vi69 


9706 ‘"Vfis 


8935 


10079 ‘""/i29 



The evolved player beats all the other players, whilst the non-evolving players 
perform evenly across all of the tables. It is also interesting to note that the evolving 
player participates in more games, due to a more aggressive nature. However, this 
does not mean that the player is guaranteed to win. In fact, the opposite is true; the 
more games played the more likely it is the player will be open to defeat, playing with 
lower rank cards. Therefore, it suggests that when the evolving player holds a strong 
hand it takes a very positive approach, by raising frequently. 

It is an interesting observation that the non-evolving loose players play more hands 
than tight players due to an overestimation of their hand value. In general, the loose 
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players lose more than tight players and the evolving player does better against the 
loose players. It is well known that tight poker players will do better than loose 
players but there is a balance to be struck otherwise a tight player would only ever bet 
with the best hand. It would appear that the evolving player has found such a balance. 

So far, the players at a given table have all been of the same type. Figure 6 shows 
how the evolving player competes when there are four different types of player at the 
same table (we also tested a variety of different players at the same table and the 
results are similar). 

The results confirm our intuition that the loose players do worse than the tight 
players. It is also pleasing to see that the evolved player beats all the other players. 




Epoch Number 



Evolving 

TA 

TP 

LA 

LP 



Fig. 6 : A Table Consisting of each Type of Player 



4. Conclusions and Discussion 

This paper has carried out an initial investigation as to how a computer program 
can learn how to play poker. We realize that this is only an initial investigation but we 
feel that we have shown the method we propose, although simple in its 
implementation, does show that an adaptive poker player is a promising research 
direction. Not only would several competing research groups be able to promote the 
research domain but, a sustained research strategy could derive benefits in other areas 
such as bluffing, negotiation and dealing with imperfect information. These insights 
would be valuable in domains such as on-line auctions, game playing theory, 
negotiating and real world economics. 

Our current research plans will consider using Texas Hold ‘Em as a more suitable 
poker variant. We are also experimenting with evolutionary strategies in place of the 
simple learning technique we currently employ. We also plan to experiment with co- 
evolution so that different strategies have to fight to survive to a future generation. 
Finally, we will also incorporate bluffing and negotiation so as we feel these elements 
are needed in order to compete with the best human players. 
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Abstract. This paper presents our approach and a fully implemented 
system for incrementally building complex adaptation functions for case- 
based reasoning (CBR) systems. 

Building a CBR system still remains a difficult task due to the difficulties 
of developing suitable retrieval and adaptation mechanisms for a given 
application. To address these difficulties, we extended the basic Ripple 
Down Rules framework to allow the incremental development of an adap- 
tation function during the use of the system for solving actual problems. 
In our approach the expert is only required to provide explanations of 
why, for a given problem, a certain adaptation step should be taken. 
Incrementally a complex adaptation function as a systematic composi- 
tion of many simple adaptation functions is developed. Our approach is 
effective with respect to both, the development of highly tailored and 
complex adaptation functions for CBR as well as the provision of an 
intuitive and feasible approach for the expert. 

The approach has been implemented in our CBR system MIKAS, for the 
design of menus according to dietary requirements. 

In this paper we present experimental evidence for the suitability of our 
approach to address the adaptation problem in the development of CBR 
systems. 



1 Introduction 

Case-Based Reasoning (CBR) is an AI approach to build intelligent systems 
which increasingly finds entry into the industrial practice. The basic idea is to 
solve new problems by remembering solutions to problems which are similar to 
the current problem. Usually, one or multiple cases are remembered, which are 
similar to the current problem case and allow the derivation of a solution for the 
current problem case from the solutions of the remembered cases. If necessary, 
a remembered case is modified in a way that at least parts of the case is reused. 
This process is known as the case adaptation [4]. A major practical advantage 
in CBR is the fact that experts are often eager to tell their “war stories” - the 
cases they encountered. This is opposed to the situation where experts are asked 
to provide abstract general rules of what they do, which is a much harder task 
for them. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 201-212, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




202 



A.S. Khan and A. Hoffmann 



However, the effectiveness of a CBR system depends not only on having and 
retrieving relevant cases but also on selecting which retrieved cases to apply 
and determining how to adapt them to fit new situation [5]. Both, suitable 
retrieval as well as adaptation of a case will usually require domain-dependent 
knowledge. Case-based reasoning systems generally do not refine the methods 
they use to retrieve or adapt prior cases, instead relying on static pre-defined 
rules or procedures [7]. It is generally impossible to anticipate all the difficulties 
and problems one may encounter in a domain. As a consequence, the knowledge 
represented in cases is often insufficient for an effective CBR system. In practice 
the problem of defining proper adaptation rules represents a major bottleneck 
for successfully developing CBR systems. The problems are so acute that many 
CBR applications simply omit case adaptation [11]. This demands the need of 
acquiring case specific and general domain knowledge as an ongoing process for 
effective CBR performance. 

Experiences from knowledge acquisition for expert systems have also shown 
that it is very difficult to obtain the relevant knowledge from an expert, as 
experts are usually unable to provide precise rules which would describe their 
decisions. 

Ripple-Down Rules have been developed as an extremely effective approach 
for the acquisition of classification knowledge, as they require the expert only to 
provide explanations of the decision taken in a given situation. 

In this paper, we introduce a radically new approach for developing a suitable 
adaptation function. We encode adaptation knowledge in rules, which somewhat 
resemble the rules used in the INRECA approach [2]. The experiences in the 
INRECA project also showed the difficulty of actually encoding the suitable 
adaptation knowledge [2]. We address this experienced difficulty with our new 
approach: a new way, how an expert interacts with the system in order to pro- 
vide suitable adaptation rules. Our approach for the acquisition of adaptation 
knowledge is based on ideas of Ripple-Down Rules [3]. 

The paper is organised as follows: In section 2, we present the motivation 
and technical details of our approach. Section 3 illustrates, how the approach is 
implemented in our menu design system MIKAS. Section 4 presents the results 
of our ongoing evaluation studies so far. The conclusions are found in section 5. 



2 Incremental Knowledge Acquisition for Building CBR 
Adaptation Functions 

2.1 Motivation 

Building Case-Based Reasoning systems successfully often requires the devel- 
opment of a specialised retrieval function, which is complemented by a highly 
specialised adaptation function tailored to the domain. It must be ensured that 
the retrieved case can be successfully adapted if required. 

For example, in our system MIKAS for designing menus according to dietary 
requirements, cases are menus along with patient descriptions. A suitable menu 
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has not only to match the nutritional requirements but also a large number of 
special requirements for various health conditions need to be accommodated. 
Such special requirements may range from avoiding spicy food, over minimis- 
ing the salt content, to ensuring that all foods come in sufficiently small parts 
which may even exclude thick drinks, such as milk shakes, due to the patient’s 
difficulties when swallowing. 

We lend ideas from Ripple-Down Rules [3], that ensures an incremental cov- 
erage of the expert’s judgements and decision making. That is, for all problem 
cases for which the expert provided advice to the system on how to adapt a 
retrieved case, the system will have to reproduce the expert’s performance. 

We assume a domain expert will have a sufficient understanding of how to 
adapt a retrieved case in order to find a solution to the current problem. 

2.2 Ripple-Down Rules 

The basic idea of Ripple-Down Rules (RDR) [3,8] is to develop a knowledge base 
by allowing the expert to directly interact with the system and to incrementally 
add rules to a knowledge base. RDR has been applied successfully in building 
what appears to be the largest expert system in routine use [6] . It has also been 
applied to some construction tasks, such as the Sisyphus I problem in [12]. 

In RDR, the object space to be classified is incrementally subdivided into 
smaller and smaller partitions, until all objects in each single partition belong 
to the same class. The rules for subdividing the object space are specified by 
the expert, whenever an object is encountered, which the system classifies in 
disagreement with the expert. That is, the current object x falls into a partition 
p of the object space, which classifies the object incorrectly. Hence, this partition 
needs to be further subdivided into two partitions Pi,P 2 , such that the partition 
to which X belongs, classifies x correctly. Such a subdivision can be provided by 
the expert competently and with minimal effort, as the expert is merely required 
to provide an explanation of why x is different from the previously presented 
object Xp, which led to the creation of the partition p in the first place. That is, 
the expert needs only to provide a criterion by which x differs from Xp and which 
justifies the different classification. To provide such an explanation is usually easy 
for an expert as it is not much different to explaining their decision to colleagues. 

See Figure 1 for an RDR tree. An object is classified by this tree as follows: 
Initially the ’default rule’ in node 1 is evaluated, i.e. class ’-’ is obtained. Before 
this becomes the final ’verdict’, it is checked whether any ’except’ link from the 
current node exists. If there is such a link, the connected node’s condition is 
checked. 

Here, we check the condition of node 2: if ”C” is true, then the corresponding 
class ’-k’ overwrites the previous ’verdict’, unless this in turn is overwritten by 
another except link. If an except link exists, but the connected rule condition is 
not satisfied, say in node 3, then the nodes along a possibly existing chain of ’if 
not’ links are checked. For instance, if the condition of 2 is satisfied, node 3 is 
checked. If the condition of node 3 is not satisfied, then node 5 is checked and, 
if node 5 is not satisfied, node 6 is checked. If any of these nodes 3, 5, or 6 is 
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Fig. 1. An example of a Ripple Down Rules tree. 

satisfied, the classification of the first satisfied node is the final verdict, unless 
this node has again an overwriting except link. 

If the expert is not satisfied, say with the classification due to node 5, the 
system asks for a justification in terms of the current object’s attributes. This 
explanation is then used as the condition to a new exception rule to node 5. 

2.3 Formal Preliminaries 

In general we consider attributes to be either numerical or discrete valued.^ 

We describe a case C = {P,5} by two attribute vectors with the domains 
Pi X P 2 X ... X Pn and S'! x S '2 x ... x Sm respectively. 

— The attribute vector P = (pi, ...,Pn) represents the problem specification. 

— The attribute vector S = {si,...,Sm) is an ordered list of components and 
represents a solution to the problem P. 

We call Comps the available component list. This is the general list of all 
possible types of components which can be included in the design. Hence, the 
components in the solution are also taken from the general list of available com- 
ponents, i.e. s^...,Sm G Comps. 

Each component c G Comps is described by a number of attributes, called the 
component attributes, i.e. c = {ca,i,Ca, 2 , Ca,k)- The domain of each component 
attribute Ca,i is denoted by Ca,i- That is, each component is an element of Ca,i x 
... X Ca.k- 

In our case of diet construction, this is a database of possible foods, describing 
the nutrient content and food type of each food. 

The case base contains cases, which are composed of a problem specification 
and a solution. The solution is a design, composed of a number of components 

^ Numerical valued attributes may be integer or continuously valued - the critical 
point is that the possible values have a meaningful order, while the discrete valued 
attributes have a finite value range without any meaningful order. 
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which are implicitly related to each other by their respective position in the 
ordered list of components S. 

Generally speaking, the problem statement is composed of two parts: 

— an attribute-value vector, stating certain properties of the problem, in order 
to choose the solution from an appropriate category. 

— a collection of constraints, which either require certain values for certain 
slots in a solution and/or which forbid certain values for certain slots. 
Constraints may also specify that certain combinations of values of multiple 
attributes should or should not occur in a solution. 

In the particular case of diet construction, our components are the various 
foods we have available and they are related to each other by composing the 
various meals of the day. 

In diet construction, there are general constraints such as the required energy 
and nutrient level of the diet as well as specific constraints such as to choose an 
appropriate diet for vegetarians or diabetes patients, etc. 

Some constraints are explicitly stated in the problem description, e.g., the 
amounts of the required daily intakes of nutrients for the task of diet construc- 
tion. Other constraints are not explicitly stated but must be provided by a 
domain expert. For diet construction, this includes constraints on which types 
of foods to use for a specific meal to ensure that it represents a sensible com- 
position of foods, such as meat combined with potatoes or rice combined with 
vegetables or salad. 



2.4 Incremental Acquisition of Adaptation Knowledge 

After cases have been retrieved, the system tries to adapt the few highest ranked 
cases in order to fit the current problem. 

If the system cannot produce a satisfactory solution, additional adaptation 
knowledge must be provided to the system. Generally speaking, this can either 
be a refined retrieval function, or a refinement of the adaptation function. The 
expert has to decide whether the retrieved cases are feasible to be adapted or 
whether other cases need to be retrieved. 

If the expert decides to adapt the solution of a retrieved case, he/she tries to 
manually adapt the solution so that a solution for the current problem is found. 
This adaptation is done by removing and adding components. Subsequently, the 
system requests explanations for each adaptation action the expert chose. To 
do so, the expert has to provide a condition under which the chosen adaptation 
action (the removal or addition of a component) should be executed. This condi- 
tion is expressed, generally speaking, in terms of the current problem description 
as well as the retrieved solution to be adapted. 

Furthermore, the action needs to be specified beyond being either a deletion 
or an addition of a component: The particular component which is added or 
deleted from the particular slot of the solution needs to be specified. This is done 
again by providing conditions which a component has to meet to be eligible for 
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deletion or addition from or to a particular slot or class of slots. The new rule 
r' is integrated into the existing Ripple Down Rules structure as an exception 
to the rule r, which produced the undesired adaptation action, which the expert 
intends to supersede by the new rule. The system ensures that the expert entered 
conditions, which do not apply to all those cases, to which the previous rule r 
applied successfully, i.e. for which rule r led to an adaptation action, which was 
accepted by the expert. 



Abstract actions. The purpose of abstract actions is to allow a way of gen- 
eralisation. An adaptation rule is provided by an expert, who encounters an 
individual case which needs to be adapted, while observing the system’s perfor- 
mance. The expert will be able to decide which component should be replaced 
by which new component. 

However, just to provide the identity of those components will render the 
CBR system incapable to adapt a case which contains not exactly the same 
but perhaps a very similar component. Similarly, the new component may need 
some variation in a new case which need to be adapted. As a consequence, our 
system lets the expert specify the action, he/she suggests for adaptation of the 
given case, and then asks the expert to abstract from the individual action and 
to give a more abstract description of that action (usually a replacement of a 
component, an increase or decrease in the amount or size of a component). 

An abstract action can be considered as a set of rules in itself. That is, 
depending on the features of the solution to be adapted, the abstract action 
may result in different changes to a different case. One important aspect is, 
that numerical feature values can be used to determine numerical aspects of the 
change. Most notably, the numerical attributes of components which are to be 
removed or added to the design can be determined by largely simple arithmetic 
calculations. 

Formally, we allow to define abstract actions by a set of rules. Each rule has 
a condition part as described in the previous subsection and an action part. The 
action to be specified is either to delete, to add a certain component or to increase 
or decrease the amount /size of a component, which can be identified using certain 
attributes of the component and whose numerical attributes can be described 
using the attribute values of any part of the current case. A replacement- action 
is then composed of a delete-action and an add-action. 

A delete-action just needs to identify the component to be deleted. This is 
done by specifying attribute value ranges for the attributes of components, which 
have to be matched by the component which is deleted from the current case. 

The formal description is essentially the same as for the conditions of adapta- 
tion rules as mentioned earlier, although the involved attributes do not describe 
the case but rather the component. The first component c = (co,i, . . . , Ca,n) G 
Comps will be deleted, which is found in the component list of the case and for 
which the following condition holds: 

— Ca,ii ^ A ... A Ca,i^ < Coy*. < ^a,ik,max) ■> 
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where 5 ■ ■ ■ 5 G {ca^i , . . . , . 

Similarly, add-actions can be defined as follows: add the component c = 
(cap, . . . ,Ca^n) from the available component list, for which the following condi- 
tion holds: 

— ^a,jl — '^a.Jl.max A ... A Caj,, < ('a,jk — ^a,jk,max) 1 

where Ca^j^, ■ • ■ , ^a,jk ^ ■ ■ ■ 5 Ca,n'^’ 

3 MIKAS: Integrated KA and CBR Workbench 

MIKAS implements the above sketched strategy for developing a CBR system. 
We used our RDR based approach for both phases, case retrieval [10] and adap- 
tation [9]. In this paper, we will focus on the adaptation process and present 
first experimental results with the fully implemented system. 

3.1 The Problem Domain of Diet Construction 

A menu is recommended by a dietitian for a patient with specific health con- 
ditions. These health conditions represent stringent requirements on the menus 
to be designed. This does not only include certain amounts of various nutri- 
ents which should be contained in the diet. It may also refer to certain ways 
of preparing foods, e.g., use of spices, certain types of foods (vegetarian, kosher 
meat, etc.), or certain other aspects of the involved foods, such as their texture 
or flavour. The significance of this is that for ’normal’ menus it seems appropri- 
ate to retrieve cases according to how well they match the required nutrients. 
However, adaptation may be necessary in order to scale the amount of foods 
to match nutrient requirements or to remove or replace certain types of foods 
which are unsuitable for the given patient. 

Our system MIKAS accepts a description of the patient along with nutritional 
requirements. Applicable restrictions are not necessarily explicitly given, but 
acquired, largely in the form of adaptation rules, provided by the expert. 

Once a problem description has been entered, the expert considers the re- 
trieved menus. If he/she is not satisfied with the retrieval result, the retrieval 
function will be enhanced. Otherwise, he/she tries to adapt one of the retrieved 
menus, using various adaptation steps including the the addition, deletion of 
food or the increase/decrease of portions. 

3.2 A Sketch of a Knowledge Acquisition Session with MIKAS 

The patient description of a 60-years old duodenal ulcer patient, who usually has 
non-vegetarian, English meals was entered into MIKAS. The system retrieved 
menu 1. The expert did not agree with the retrieved menu for various reasons, 
including the coffee in the dinner, which is harmful for ulcer patients, etc. 

So the expert modified the retrieved menu by taking out the breakfast grape- 
fruit juice which is considered as harmful for the patient. The expert further 
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(b) Adapted menu 2 



Fig. 2. (a) A new menu before and (b) after automatic adaptation using expert rules 
entered while adapting a previous case. 



increased the portion size of the asparagus in the lunch in order to better match 
the nutrient requirements. He further deleted the dinner coffee for its caffeine 
content, and added baked custard to the snack slot for 11am, as an ulcer patient 
needs frequent meals in regular intervals. At the end of the adaptation steps, the 
expert found the resulting menu suitable for the patient. 

After that, MIKAS asked the expert for justifications for each of the actions 
taken. That is, the expert was asked to provide conditions, which have to match 
the current case, under which the corresponding action should be taken. Fur- 
thermore, MIKAS asks the expert, to describe to taken action in more abstract 
terms. That is, if the expert deleted the grapefruit juice from the breakfast slot, 
the expert was asked whether only grapefruit juice or any fruit juice or even any 
fruit meeting specifiable characteristics, should be deleted. Furthermore, whether 
the action should only be applied to the breakfast or to other menu sections as 
well. 

In another session a description of a patient, who is much younger in age but 
has similar health conditions was entered into MIKAS, together with nutrient 
requirements. The nutrient requirements for this patient were higher, because he 
is younger and, hence, needs more energy supply and more protein as he is more 
active. For this patient, MIKAS retrieved menu 2 in Figure 2 (a) and adapted it 
to Figure 2 (b). The rules entered before resulted in the following actions when 
applied to menu 2: 



1. there was no grapefruit juice in the breakfast slot, but MIKAS deleted or- 
ange&mango juice instead. Reason being that it belonged to the same group 
of fruit juices, as specified in the abstract action by the expert. 
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2. MIKAS did not find asparagus in the lunch slot of menu 2, so baby carrots 

were taken instead and the portion size was increased. 

3. MIKAS deleted coffee from the dinner slot. 

4. MIKAS added baked custard to the snack slot. 

If the expert finds the deletion of orange&mango juice inappropriate in the 
case of menu 2, he can say so and suggest an alternative action. MIKAS will 
then ask again for conditions under which the alternative action should be taken. 
These conditions must differentiate between menu 1 and menu 2 and/or between 
the first and second patient, so that the first rule would still apply in the first 
case and the new rule would only apply to menu 2. This is accommodated in 
the structure of the Ripple-Down Rules and ensures that the task to provide 
suitable conditions can be easily accomplished by a domain expert. 

4 Experimental Studies 

We developed a CBR system using the presented incremental knowledge ac- 
quisition approach. In our experiments we focussed on the development of the 
adaption process of our CBR system. The retrieval function was also developed 
incrementally, but for most cases the retrieval function was not changed, even if 
the retrieved case appeared rather inappropriate. This was done to better eval- 
uate the potential of building adaptation knowledge bases with our approach. 

One of the authors, Abdus Salam Khan, being a trained dietician served as 
the expert. We focussed on the following specific type of patients: The patients 
had an English food habit, were aged 35-45, had liver malfunction, and hence 
the fat and protein content of their diet need to be strictly complied with the 
given nutrient requirements. Due to varying body weight, physical activities and 
age of patients in this group different nutrient requirements were given which 
needed to be matched by a suitable menu. 

Due to the varying nutrient requirements, the CBR system retrieved different 
menus which in turn needed to be adapted. It was usually not possible to simply 
scale all portion sizes because the nutrient requirements of one patient were not 
linearly related to other patient’s nutrient requirements. 

Figure 3 (a) shows how our adaptation knowledge base grew with the number 
of presented menus. The menus counted were either directly retrieved from the 
case base or they were already partially adapted by the system and needed 
further adaptation. Both is shown, the number of rules which were added to 
the knowledge base and the number of acceptable adaptation actions proposed 
by the system. Figure 3 (a) shows how the discrepancy between the two curves 
grows with increasing ‘experience’. The increasing ‘experience’ here means a 
growing number of presented menus, for which the expert judged whether the 
system’s knowledge base proposed an appropriate adaptation action or not. 

After the knowledge base grew to some 180 rules, it was already capable 
of suggesting adaptation steps which were acceptable in the majority of the 
presented cases. Given the large variety of patient requirements and retrieved 
menus, this result is satisfactory. Currently, we are building a larger knowledge 
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(a) RDR cases vs acceptable (b) RDR cases vs CBR cases 
actions 

Fig. 3. (a) The growth of the knowledge base and its competence to handle adapta- 
tion with the number of menus presented. An RDR case here is a menu that requires 
adaptation and is presented to the system and the expert. A retrieved menu (a CBR 
case) may require multiple adaptation steps, and thus would result in multiple presen- 
tations of modified menus to the expert. Accordingly, the various menu versions will be 
counted as multiple RDR cases, (b) How the competence of the adaptation knowledge 
base in handling cases increases with experience. A single CBR case usually involves a 
sequence of multiple adaptation actions to be determined by the knowledge base. 



base which will also be able to handle a larger diversity of patient types. We will 
be able to report on the results with a larger knowledge base in the next few 
months. While the current knowledge base cannot handle the complete adapta- 
tion of every retrieved case, it can at least make some useful adaptation steps 
automatically. A system that assists an expert in semi-automatically construct- 
ing a suitable diet for a new patient is already a substantial help. 

In Figure 3 (b) it is shown for how many different patients the adaptation 
was handled by the adaptation knowledge base. For most retrieved cases it took 
3-8 adaptation steps to turn a retrieved menu into a satisfactory menu for the 
patient at hand. There was a relatively large number of retrieved cases where the 
expert decided at some stage that no further adaptation steps should be taken 
as the chances of successful adaptation of the retrieved menu appeared too slim. 
Instead the expert worked on the retrieval function of our CBR system to allow 
a more suitable case to be retrieved which then in turn can be adapted. We also 
found that for many problem instances our case base did initially not contain 
suitable cases at all. 

We conducted our experiments with a constant case base to find how well 
our approach allows to build really flexible adaptation functions. However, in a 
practical situation, we would rather add a substantial number of the satisfactorily 
adapted menus to the case base to allow retrieval of a more suitable case for a 
similar problem instance. The results reported here are intermediate results of 
ongoing evaluation studies. By the time the camera ready copy is due, we will 
be able to report on more detailed results. 
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5 Conclusion 

We presented our incremental approach for developing CBR systems where the 
expert directly interacts with the system and adds to its knowledge base during 
use of the system. Our implemented knowledge acquisition workbench MIKAS 
also allows the rapid development of expert support systems, which are not 
required to produce a complete design (or case adaptation more generally), but 
may produce an almost satisfactory design which is then manually completed by 
the domain expert. While completing the design, the system will interact with 
the expert in order to extend its knowledge base such that it is able to handle 
the current design automatically. Our experiments indicated the feasibility of 
our approach for complex design problems such as diet construction. 

Our experiments showed that the acquisition of effective adaptation knowl- 
edge for a limited range of problem cases is effective. The required knowledge can 
be provided by an expert and can be organised into a Ripple Down Rules like 
structure with relative ease. While our experimental results need to be comple- 
mented by a more complete study which is currently underway, the preliminary 
results are very encouraging. Our approach offers an alternative to the traditional 
labour and time-intensive approaches for building CBR systems. We believe that 
the presented approach can also be adapted for allowing the development of CBR 
systems for many other design tasks, helpdesk systems and other applications, 
where CBR systems have been successfully employed recently, see e.g. [1,13]. 
The strengths of our approach include the following: 

— it allows a domain expert to directly interact with the system without the 
need for a knowledge engineer or a system engineer. Adaptation rules are 
entered by the expert. The expert is guided by the system in providing 
suitable conditions for the rule’s application. 

— the expert is only required to explain, why certain solutions should be 
adapted in the demonstrated way for a given problem. This is much eas- 
ier than to provide general rules for adaptation. In our approach, the expert 
needs only to justify the particular adaptation steps he chose. 

— If an explanation results in overly general adaptation rules, this will be re- 
paired as soon as the expert encounters a situation, in which the system 
would perform unsuitable adaptations. 

— The integration of the incremental KA approach into a CBR system will 
hopefully result in substantially less effort on the side of the expert to build 
a satisfactory knowledge base compared to incremental KA that targets com- 
plete construction tasks from scratch such as [12]. 

Open problems which will be addressed in future research include the smooth 
integration of the case retrieval and case adaptation process. At this stage, we 
asked the expert to adapt a retrieved case if possible at all. Modifying the re- 
trieval function in a CBR system may alleviate the difficulty of developing a 
suitable adaptation function substantially, as less adaptation is needed if a more 
suitable case can already be found in the case base. 
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In order to allow a balanced development of both, we envisage to provide 
support tools for the expert, such that the expert can quickly see what the 
system would do with a possible candidate case that has not been retrieved by 
the current retrieval function, but might be retrieved if the expert decides to 
modify the retrieval function. For the quick retrieval of a suitable case from 
the case base, a suitable query language is needed that allows the expert to try 
various queries in order to identify a case that should be retrieved. 
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Abstract. This paper presents an approach for a real-time region-based motion 
segmentation and tracking using an adaptive thresholding and A:-means cluster- 
ing in a scene, with focus on a video monitoring system. In order to reduce the 
computational load to the motion segmentation, the presented approach is based 
on the variation regions application of a weighted ^-means clustering algorithm, 
followed by a motion-based region merging procedure. To indicate motion 
mask regions in a scene, instead of determining the threshold value manually, 
we use an adaptive thresholding method to automatically choose the threshold 
value. To image segment, the weighted A:-means clustering algorithm is applied 
only on the motion mask regions of the current frame. In this way we do not to 
process the whole image so that the computation time is reduced. The presented 
method is able to deal with occlusion problems. Results show the validity of the 
presented method. 



1 Introduction 

Segmentation and tracking of moving objects from an image sequences are a basic 
task for several applications of computer vision, e.g., video monitoring system, intelli- 
gent-highway system, intrusion surveillance, airport safety, etc [1-4]. Traditionally, 
the most important task of monitoring safety is based on human visual observation. 
However, an autonomous system that is able to detect anomalous or dangerous situa- 
tions can help a human operator, even if it cannot completely replace the human’s 
presence. To facilitate a monitoring system, efficient image detection and segmenta- 
tion algorithms need to be used. Segmentation of an image usually divides the image 
contents into semantic regions that can be dealt with as separate objects. A region 
created by a segmentation algorithm is defined as a set of elements (pixel of images) 
which are homogeneous in the feature space and connected in the decision space. A 
region may not have any semantical meaning. One of the crucial elements of a moni- 
toring system is the motion analysis component, which segments moving objects from 
an image sequence and estimates their motion on the image plane. Moreover, an accu- 
rate segmentation of the object is needed in order to estimate the motion accurately. 
On the other hand, a moving object is characterized by coherent motion characteristics 
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over its entire region of support. Therefore, an accurate estimate of the motion is re- 
quired in order to obtain an accurate segmentation of the object. These tasks are diffi- 
cult for several reasons: there are multiple moving objects; the objects of interest are 
usually small (in the image plane) and poorly textured; illumination conditions may be 
poor and change rapidly; and multiple occlusions are likely and the environment may 
be cluttered. However, probably the most challenging obstacle is the requirement of 
real-time performance using relatively cheap hardware. Besides, many automatic 
surveillance applications operate in real-time, so good automatic segmentation algo- 
rithms are required. These specific difficulties and constraints mandate that a standard 
off-the-shelf algorithm cannot usually be applied. Instead, special motion segmenta- 
tion algorithms must be designed. Most typical approaches to object segmentation are 
based mainly on motion information, segmentation by dominant motion analysis [5], 
MRF modeling [6-8], and Bayesian methods [9,8]. A major problem with all of the 
above methods is that their performances are limited by the accuracy of the motion 
estimation, which is itself an ill-posed problem. The pixel-based motion segmentation 
method, including [11], suffers from the drawback that the resulting segmentation map 
may contain isolated labels. Motion segmentation algorithms, including the temporal 
image difference-based method [12], motion-based method [13] and model-based 
method [14], were developed. These methods, however, suffers from poor detection in 
the real-world environment. What is worse, since the viewer moves in a dynamic 
scene, it is difficult to extract only the regions corresponding to moving objects using 
familiar methods of motion segmentation. These problems have made the visual per- 
ception of the real-world environment a difficult and challenging topic of computer 
vision. At present, the segmentation of moving objects from images is not satisfactory. 
Although the k-means clustering algorithm is simple in principle, it requires a lot of 
computation and the threshold is usual selection by trial and error. 

In this paper, a procedure for region-based motion segmentation and tracking of 
image sequences is presented. As a basis for region-based motion segmentation, mo- 
tion detection with an adaptive threshold method [15,16], region segmentation with 
the weighted k-means clustering algorithm and motion segmentation with motion 
information method are used. Region-based approaches define groups of connected 
pixels that are detected as belonging to a signal object that is moving with a different 
motion from its neighboring regions. Region fracking is less sensitive to occlusion due 
to the extensive information that regions supply. The rest of this paper is organized as 
follows. In Section 2, we present the method of region-based motion segmentation and 
fracking from an image sequence. The section includes four parts: motion detection, 
region segmentation, motion estimation and motion segmentation. Experimental re- 
sults of this method are shown in Section 3 and finally, conclusions are given in Sec- 
tion 4. 



2 Outline of the Process 

Fig. 1 shows the process of the presented method. First, in the motion detection, the 
rough position of moving regions in an image is determined. From the motion detec- 




A Real-Time Region-Based Motion Segmentation 215 



tion a motion mask is created, to indicate the moving region position on the eurrent 
frame. To indieate binary motion mask, instead of manually determining the threshold 
value, which is the case in most vision-based systems, we use an adaptive threshold to 
automatieally ehoose the threshold value for motion deteetion. Then, we apply the 
weighted k-means clustering to the motion mask on gray level image of the eurrent 
frame. By this way we do not need to process the whole image, which saves computa- 
tion time. Only in the segmented regions, do estimate the motion by the heuristic 
measures (matching criteria). Thus, the regions that have similar motion vectors are 
merged. When the presented method tested on real image sequence, the performanee 
is robust not only in the variation of luminance conditions and the ehange of environ- 
ment condition, but also in the ocelusion among the moving objects. 



Motion Masks List of regions List of regions with List of regions 

motion estimation 




Motion ^ ^ Region \ ^ Motion \ / Motion i / Object \ 



Detection y \Segmentatio^^ ^ Estimation ^ ySegmentatio^^ ^ Tacking 



Fig. 1. The process sequence of the presented method 



2.1 Motion Detection 

The first phase of the presented algorithm indieates the moving objects’ position in the 
eurrent frame. In general, the motion of the moving object entails intensity ehanges in 
magnitude so that intensity changes are important cues for locating moving objects in 
time and spaee. These intensity changes ean be represented not only by the differences 
between two successive images, but also by the differences between the eurrent image 
and baekground image. To indicate the variation region from the background, instead 
of determining the threshold value manually, our present method used an adaptive 
thresholding to automatieally ehoose the threshold value for the variation region de- 
teetion method. In Fig. 2, we show the motion detection phase. 




Fig. 2. Flow of the motion detection process 
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2.1.1 Variation Region Detection 



The motion detection phase receives the input of a pair of gray-level images, 4, l^+i, 
and a predefined background image, B, in which no object exists. The output are the 
variance regions of the image area where significant changes have been detected. The 
two subtraction images ( D,{x,y) = \I,{x,y) - I,^{x,y)\ , D,{x,y) = \I,{x,y) - B{x,y)\ ) 
between the Ath frame, k-l-lth frame and background image are computed. The two 
difference images (4,4) are then computed by: 



T,(x,y) = 



1, if Dfx,y) > t, 
0, otherwise 




1, if Dfx,y) > 
0, otherwise 



( 1 ) 



The selection of threshold values (?,, tf) is obtained by an adaptive thresholding. The 
motion mask (A4) and background update image {B') are then computed by: 



Mfx,y) = 



1, if (Tfx,y) D Tfx,y))^0 

9 

0 , otherwise 



B(x,y) 



M,(x,y) - T^{x,y). 



( 2 ) 



2.1.2 Adaptive Thresholding 



This method is a histogram-based approach to thresholding for motion detection and 
can be used to discard temporal variations due to illumination. The method for thresh- 
old selection was derived under the assumption that the histogram generated from the 
difference between two gray-level images contains three values combined with addi- 
tive Gaussian noise. The mixture probability density function of the difference value is 



P{d) = 



co^ 

V^(T| 



■exp 
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( 3 ) 



where , is the population proportion, , is mean value of the three difference levels 
and Gi is the standard deviation about the means. As a result, the threshold problem is 
formulated as determination of the best thresholds, 9i and 02, separating the three 
Gaussian models from one another. A threshold 9 [-255, 255] divides the image 

into three distributions: pf[-255, -9\), p2([-9, 9\) and P3{{9, 255]). It is noted that the 
optimal threshold occurs where the two modes meet or in the valley between the 
maximums of the two modes. Due to the symmetrical nature of the histogram model, 
9j = -0and 62 = 9 In Eq. (8), the respective population proportions, means and vari- 
ances are given, respectively, by 
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255 1 255 1 255 

Ciy^{6)='^m{d), JUj(9)= — ’^dnid), = — ’'^d — m{d) ■ (6) 

d=e+i d=e+\ d=e+\ 

Let m(d) be the probability of a differenee values and n is the spatial domain of the 
difference image, which is defined as 

m{d) = — num(d) . (7) 

n 

In Eq. (12), num(») is the function counting of the pixels. The threshold value {0) is 
determined by fitting criterion defined as follows 

1 255 

e = - Y^P^d)-m{d)f ■ (8) 

W rf=-255 

A best fit between the data and Gaussian model is found by minimizing the mean 
squared error between the mixture density (P(d)) and the probability of a difference 
value (m(d)). Fig. 3 shows the histogram probability distribution and the Gaussian 
model. 0 in Fig. 3 that minimizes the fitting criterion is considered the threshold value 
separating Gaussian models from one another. 




JlffL-hcAii' 



Fig. 3. Probability of difference values and Gaussian model for the image sequence of Fig. 4 

At the end of this phase, a binary motion mask (A4) is obtained where changing pixels 
are set to one and background pixels set to zero. This operator is spontaneously im- 
mune to noise due to the non-repeatability of noise in two subsequent different frames 
and filters isolated spots arising from small movements of sensors. Moreover, in the 
experiments on image sequences, the points of Mi^x.y) occupied by moving objects, 
therefore, precisely locate moving objects. Next, a chain-code-based approach is able 
to separate loosely connected moving points with simple morphological operations. 
Then, we apply the morphological operation on the difference image to remove the 
noise and get the mask for the region segmentation. This technique locates the rough 
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positions of the moving objects. The only area where both frame differences are 
meaningful is at the location of the kih frame. Fig. 4 shows the results of the motion 
detection phase using the threshold value {0) in Fig. 3. 




Fig. 4. Results of motion detection in the road image sequence, (a) input image sequence, kih 
frame, k+lih frame, and background image, respectively; (b) time difference image when 
threshold value is 61; (c) background difference image when threshold value is 48; (d) motion 
mask 



2.2 Region Segmentation 

The purpose of this phase is to segment pixels of similar intensity that correspond to a 
single object. The region segmentation phase segments the entire motion mask regions 
into homogeneous regions in term of intensity. The different homogeneous regions are 
distinguished by their encompassing boundaries that can be obtained from this phase. 
Our segmentation method is based on a classical k-means clustering algorithm. Pixel 
intensity allows the algorithm to separate the pixels of different regions as well as 
consider to the image coordinates concentration of the pixels in the motion mask re- 
gion. 



2.2.1 ^T-Means Clustering 

The region segmentation algorithm can be thought of a variation of the k-means clus- 
tering algorithm [16,17] and incorporates three feature vectors for each pixel. The 
feature vectors are the coordinates and intensity. It is used to look for the region that 
minimizes a weighted squared Euclidean distance measure. The iteration stops when 
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the maximum shift of the clusters drops blows a specified value. The initial cluster 
means choose arbitrarily. Pixel p' is assigned to cluster c' , j = 1, 2,..., number of 
clusters. In practice, each motion mask region to the number of j is determined by M^, 
which is the number of detected motion mask regions, y = + 2. At each iteration, a 

pixel (/) of the original cluster is assigned to that new cluster (/) which minimizes the 
following criterion: 

e' =(p‘ - - r) ( 9 ) 

where p‘ is a vector composed of the coordinates and the intensity of the pixel i, is 
the vector composed of the mean coordinates and mean intensity of the cluster j, 
j-i ={mj ,m') ^nd W' is a 3x3 diagonal matrix that contains a weight for each fea- 
ture of clusters W ’ is determined by minimizing the distance between the 

pixels in cluster c ' . These weights are given by [18]. 



j i J 


( 10 ) 


cJ = 


( 11 ) 


(a i J = V {x - mP)^ 


( 12 ) 



where . is the number of pixels in cluster j and < 7 ^ , are the coordinates (x, y) 

variance and cji is intensity (i) variance of the cluster j. Every cluster has a matrix W. 

This matrix is recalculated every five iterations. Fig. 5 shows the segmenting results of 
a simplified image. The simplified image exhibits that the detailed texture of the mov- 
ing object body and gray information of the original image are smoothed out, but the 
object shape is preserved. Moreover, the number of segmented regions are decreases 
in the detected moving region. 



2.3 Motion Estimation 

This phase is finding the motion information of segmented regions that minimizes the 
sum of displaced region differences. The summation is done over all pixels of the 
segmented region by the region segmentation phase. The regions are associated with a 
three-dimensional (3-D) feature vector describing the coordinates and the gray level. 
Our motion estimation method is based on the matching of the feature vector of re- 
gions by considering the displaced region difference as well. One fundamental ap- 
proach is to math on portion of an image at time t, It, with each portion of the success 
sive image at t+1, It+i, using image features such as a coordinate and intensity. In 
estimating motion information, the algorithm used the matching window to compute t- 
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Fig. 5. Segmenting results of the original and simplified images when the number of clusters is 
five, (a) segmented image of original image in Fig 4; (b) segmented image of simplified image 
on motion mask regions 



he similarity between the two portions in /, and Im- The motion mask region was used 
to limit the search for the possible location of a particular pixel in the frame I,.i- 
Within the motion mask region, the new location of (x, y) is found by the displaced 
region difference. We minimize error measure, ME, defined as follows: 

ME ^ (/,(x,y)-/,^i(x + M,,,y + vj) (13) 

where d is the cluster j, u„ and Vy are motion vectors and Ii(x,y) and I,+i{x,y) are pixel 
intensity values at location {x,y) in the segmented region. Fig. 6 (a) and (b) shows the 
motion information of the segmented region by the region segmentation phase. The 
motion information is able to separate occlusions with difference direction. 



2.4 Motion Segmentation 

As mentioned in the region segmentation phase, the k-means clustering algorithm was 
applied to partition the images into small regions that are homogenous in terms of 
intensity. Therefore, the image partition into many homogeneous regions results in 
oversegmentation. Since the k-means clustering algorithm leads image partition to 
oversegmentation, a region-merging step is required to solve the oversegmentation 
problem. The oversegmented partition can also be relaxed using a motion information 
similarity measure. In this phase, the motion information results from the motion esti- 
mation phase are merged into their neighboring regions where the small regions are 
most similar based on the motion information Fig. 6 (c) and (d) shows the result of 
motion segmentation. 



3 Experimental Results 



In order to verify the effectiveness of our present method, an experiment was perform- 
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Fig. 6. Motion estimation and segmentation results, (a) segmented regions with motion vectors; 
(b) motion segmented image 



ed on complex road scenes acquired from a fixed viewpoint. Image sequences are 
characterized by multiple moving objects, variable illumination conditions, noises, 
artificial lighting and presence of shadows. We achieve a frame rate of approximately 
eight frames per second. The acquired images were digitized into 320X240 pixels. 
The experiments were performed on a Pentium 333-MHz PC with Windows 98 
and the algorithm was implemented using MS Visual C++ development tool. The 
processing each image frame takes 0.19 sec on average. The time depended on the 
number and size of moving objects present in the image. The average amount of proc- 
essing time per algorithm phase is summarized in Table 1. 



Table 1. Average processing time 



Step 


Time 


Average time 


Motion Detection 


15-92 ms 


48 ms 


Region Segmentation 


32-103 ms 


71ms 


Motion Estimation 


24-81 ms 


36 ms 


Motion Segmentation 


26-47 ms 


32 ms 



To show the robustness of the presented method, we performed a noise sensitivity 
test. The test scenes are a real road image containing multiple vehicles and a human 
corrupted by adding Gaussian noise with different SNR values. Table 2 gives the 
percentage of correct moving object region detection versus the increased Gaussian 
noise with different SNR values. As the rate of noise increased, the rate of correct 
location decreases. However, the presented method shows an average of 92.5% cor- 
rect location on the SNR -3dB noise added road images. This shows the robustness of 
the present method with regard to noise. From the experiment results, we show that 
the rough position of the moving object is determined. With an adaptive threshold, 
this method provides better results under an environment with a change in illumina- 
tion. The present method was effective in reducing the computation time and segmen- 
tation error when segmenting moving object in a scene. To prove its effectiveness, it 
was compared with Badenas et al. ’s method [19] in average time for moving object 
segmentation and segmentation rate. Fig. 7 shows the moving object segmentation 





222 J.B. Kim et al. 



results of the two methods. Fig. 7c shows initial segmented regions by A:-means clus- 
tering algorithm of Badenas et al’s method. In the experiment, the value of k is 50. 
Fig. Id shows the segmented regions by our method. In Fig.7e, Badenas et al’s 
method cannot be segmented because the moving objects are small and involve poor 
motion. However, the presented method is able to segment all the moving objects in 
the scene correctly. Table 3 shows the average time for segmentation and segmenta- 
tion rate a frame for the two methods. As a result of evaluation, the average moving 
object segmentation rate is 94.7%. Moreover, the result shows that the presented 
method improved both the computational efficiency and segmenting accuracy. Fig. 8 
presents the segmentation for a sequence in which three vehicles and a pedestrian are 
moving in both directions. With our method, region segmentation is applied only on 
the motion detection regions. This way. It reduce not only the computation time but 
also the segmentation error of the moving object. The moving object segmentation 
rate was evaluated by our method. The following procedure was used. 

Moving object segmentation rate (%) = ilx 100 • 

B 

( 14 ) 

A: The number of pixels in the moving region segmented by this method 
B: The number of pixels in the moving region extracted manually (Fig. 7(a) 
white boundary regions) 



Table 2. The results of sensitivity analysis 



SNR [db] 


3 0 


-3 -5 


-10 


Location rate (%) 


96.7 96.2 


92.5 73 


54.7 


Table 3. Evaluation of segmentation results 


Method 


Badenas et al’s method The presented method 


Average time 


0.78 ms 


0.16 ms 




Segmentation rate 


91 % 


94.8 % 





4 Conclusions 

In this work, we have presented an approach for a real-time region-based motion 
segmenting and fracking of moving objects using an adaptive thresholding and k- 
means clustering on images sequences, with focus on a video monitoring system. The 
method is based on gray level and motion information of the image sequence. The 
gray level information is used not only to detect variation regions, but also to segment 
moving regions. The motion information of our method is used to merge the adjacent 
regions that have a coherent motion vector. The present method has been tested on an 
outdoor environment at the intersection of two roads. Experimental results demon- 
strate the robustness in the noise and deal with occlusions. 
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Fig. 7. Result of moving object segmentation, (a), (b) road image sequence; (c), (d) regions 
segmented by Badenas et al’s method and the presented method; (e), (f) moving objects seg- 
mented by Badenas et al ’s method and the presented method 
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Abstract. A new scheme for automatic analysis and classification of 
cells in peripheral blood images is presented in this paper. The proposed 
method can analyze and classify mature red-blood and white-blood cells 
efficiently. After we identify red-blood and white-blood cells in a blood 
image captured by a CCD camera attached to a microscope, we extract 
their features and classify them by a neural network model based on back- 
propagation learning. While we have fifteen different clusters including 
the normal one for red-blood cells, there are five different categories for 
white-blood cells. We also propose a new segmentation algorithm to ex- 
tract the nucleus and cytoplasm for white-blood cell classification. In 
addition, we apply the principal component analysis to reduce the di- 
mension of feature vectors efficiently without affecting classification per- 
formance. Experimental results demonstrate that the proposed method 
outperforms the learning vector quantization-3 and the k-nearest neigh- 
bor algorithms for blood cell classification. 



1 Introduction 

Various algorithms for automated analysis and recognition of medical images 
have been proposed in conjunction with advanced artificial intelligence, image 
processing, and computer graphics techniques [1], [4], [9], [13]. As consequences, 
several automatic medical diagnosis systems have been developed to help doc- 
tors to diagnose diseases. Especially, red-blood and white-blood cells of human 
beings provide valuable information to pathologists. Such information is used 
to diagnose patients’ diseases and allows pathologists to identify morphological 
variations of blood cells. However, the inspection is time-consuming and requires 
technical knowledge. With computer-aided inspection systems, pathologists can 
have objective analysis results of blood cell images and the false inspection ratio 
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can be reduced significantly. If some part of the inspection job is automated, 
a technician can save his or her time and effort substantially. Recently, there 
was an attempt to differentiate the normal white-blood cell from three kinds of 
leukemia cells. However, it is a very difficult task even for a specialist [1]. 

Generally, we can observe red-blood cells, white-blood cells, platelets, and 
plasmas in the image of the human peripheral blood sample by the microscope. 
Based on the shape and color of the nucleus and cytoplasm, we can classify white- 
blood cells into five different types: neutrophil, eosinophil, basophil, lymphocyte, 
and monocyte, as shown in Figure 1. 

Neutrophil Eosinophil Basophil Lymphocyte Monocyte 

Fig. 1. Mature white-blood cells. 

In this paper, we focus on building a practical system that can be used in 
the hospital. We classify fifteen different types of red-blood cells including the 
normal one, as shown in Figure 2, referencing hematology literatures and using 
pathologist’s aids [5]. 
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Fig. 2. Morphological shapes of red-blood cells 



In the proposed system, we classify and count red-blood and white-blood 
cells automatically. In our experiment, we design a cell classifier using a neural 
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network model and compare its performance with two other classifiers: learning 
vector quantization-3 (LVQ-3) and k-nearest neighbor (K-NN) algorithms. We 
also reduce the number of multi-variate features using the principal component 
analysis (PCA) to construct a more efficient classifier. 

This paper is organized as followings. Section 2 describes preprocessing, fea- 
ture extraction and classification algorithms for blood cell analysis. After we 
present experimental results in Section 3, we draw conclusions in Section 4. 



2 Classification of Blood Cells 

2.1 Preprocessing 

Input images are captured from the color CCD camera attached to a microscope, 
magnified four hundred times with the resolution of 640x480 pixels. Figure 3(a) 
shows an input image. Since the clinical pathologist generally examines the ideal 
zone that has quite a few folded cells, we select the image that is noise free and 
well focused. 




(a) Input image (b) Labeled image 

Fig. 3. Preprocessing of the input image. 



For the input image, we apply a luminance thresholding method using a fuzzy 
measure [3] to separate red-blood and white-blood cells from the background of 
the image. In the labeling step, we exclude boundary cells of the target im- 
age. Each labeled cell is classified into one of red-blood cells, white-blood cells, 
platelets and plasmas based on its size and color. While the white-blood cell has 
the biggest size and has a nucleus in it, the plasma and the platelet are consid- 
erably smaller compared to red-blood and white-blood cells, as demonstrated in 
Figure 3(a). Figure 3(b) displays labeled cells enclosed by the minimum bound- 
ing rectangular boxes. 
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2.2 Segmentation 

In order to separate the white-blood cell into the nucleus and cytoplasm, we 
propose a hybrid segmentation scheme based on regions and edges. After we 
enhance image edges and remove noises by the nonlinear anisotropic diffusion 
algorithm [ 10 ], we apply a watershed transform to the image [6], [8]. We then 
merge the nearest regions by the k-means algorithm based on color information. 

Once we apply PCA and the nonlinear anisotropic diffusion algorithm to 
the input image, we can obtain important edge information by fusing the first 
component and other two components with appropriate weighting factors. Com- 
paring to other noise filtering methods, the nonlinear diffusion algorithm has 
good characteristics of removing noises while preserving the edge information. 
Since the proposed segmentation method employs the watershed transform, PCA 
and the nonlinear diffusion algorithms are very effective. PCA, also known as 
the Karhunen-Loeve decomposition, can be used to find eigenvectors of the co- 
variance matrix. In this paper, we employ the linear PCA and use the covariance 
matrix of each RGB color component to obtain eigenvectors and eigenvalues. 

After the PCA operation, a fused image A„ is generated. Regardless of the 
color model adopted, we can assume that ei, 62, and 63 are the eigenvalues 
of three principal components of the color. Weighting factors for those color 
components are calculated by 

6l 62 63 

Q!i — , Q!2 — , 03 — . (11 

61+62-1- 63 ei + 62 + 63 6i + 62 + 63 

In this paper, we employ a nonlinear anisotropic diffusion algorithm [ 10 ] to 
avoid image blurring and solve the local problem of the linear diffusion filtering 
operation. The nonlinear diffusion operation can be expressed in terms of the 
time variable t as 

= c{x,y,t)AI{x,y,t) +Vc{x,y,t)'^I{x,y,t) (2) 

where A and V are the Laplacian and the gradient operators, respectively, 
I(x,y,t) represents the image at time t, and c(x, y, t) is the diffusion conduc- 
tance coefficient and is globally changed by the local edge analysis. Perona and 
Malik proposed a function for the intensity gradient [ 10 ] as 

= 1 + (||V7||/g)i "■ (^ (^) ) 

where K is a conductance variable that controls the gradient of the image. Ideally, 
K should be selected to reflect the gradient over the whole image or neighbor- 
ing gradient values of each pixel. The time variable t controls the quantity of 
diffusion and it plays as the scale variable of the Gaussian blurring operation. 
If we set a larger value for t, the image will get more diffused. The nonlin- 
ear anisotropic diffusion algorithm has advantages in two aspects. It can reduce 
noises effectively while preserving the edge information, compared to other noise 
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filtering algorithms: median filtering and Gaussian filtering. The popular water- 
shed algorithm may result in oversegmention, if noises in the input image are 
not properly removed. However, we do not have the oversegmentation problem 
with the nonlinear anisotropic diffusion algorithm. 

In general, an edge-based segmentation algorithm needs a robust edge linking 
operation due to edge discontinuity. However, the watershed transform does 
not need any edge linking opeation, since regions are defined by closed curves. 
The watershed transform can be implemented by rain falling or hill climbing 
operation. In this paper, we employ the rain-falling method that consists of the 
following two steps. Firstly, we define a threshold value. If one pixel has a smaller 
value than its neighboring pixels, it is considered to belong to the same region. 
In the next step, remaining pixels are merged to neighboring pixels that have 
the biggest slope. It is analogous that water in the surface of topology flows to 
the direction of lower slope. 

In order to prevent oversegmenation by the watershed transform, we need 
to apply a postprocessing algorithm over segmented regions. In this paper, we 
employ the k-means algorithm, and the average intensity value of each segmented 
region is used as the measure for merging. 

We compare the proposed method with the nonparametric clustering algo- 
rithm that was originally proposed for leukemia diagnosis [1] . As shown in Figure 
4(a), the nonparametirc algorithm does not segment the input image properly 
into the nucleus and cytoplasm. Although the region merging algorithm affects 
the result, the proposed method segments regions intuitively. Figure 4(a) and 
Figure 4(b) demonstrate the segmentation results of white-blood cells by the 
nonparametric clustering method and the proposed method, respectively. Fig- 
ure 5 shows another segemntation result by the proposed method. 





(a) Nonparametric method 



(b) Proposed method 



Fig. 4. Comparison of segmentation results 
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Fig. 5. Segmentation of white-blood cells 



2.3 Feature Extraction 



Once we label red-blood cells in the input image, we extract image features from 
each red-blood cell. The red-blood cells are then classified in two steps. Features 
extracted in the first step are different from those extracted in the second step. 
We assume that they belong to the same class. 

In the first step, since normal, spherocyte, target and stomatocyte cells have 
the circular shape, the contour information of each cell is used for classification. 
In the second step, we use all edge information including interior edges as well 
as their contour information. In order to extract image features, we employ the 
Universidade Nova de Lisboa (UNL) Fourier transform [11] that is an improved 
extension of the Fourier descriptor to handle open curves and lines. 

We obtain image features as follows. The input image consisting of binary 
curve patterns is transformed from the Cartesian coordinate system to the polar 
coordinate system by the UNL transform. After an analytic curve equation is 
estimated, the transformed curve is instantiated in the polar coordinate system. 

Let n{t) be a discrete object composed of n pixels Zi = (xi,yi), O = (Ox,Oy) 
be the centroid of the object, and M be the maximum Euclidean distance from 
the centroid C to all pixels Zi. A discrete object U{f2{t)) consists of {U{zij{t)), 
a set of line segments Zij{t) between two neighboring pixels Zi = (xi,yi) and 
Zj = ixj,yj)- 

The UNL transform of the discrete object is defined by the mapping from 
the Cartesian to the polar coordinate systems. 






iij ~ (^)) “ 



( yi+t(yj-yi)-Oy 
yxi-\-t{xj—xi)—Ox) ) 



(4) 



Let i{x, y) be a two-dimensional image that represents a discrete object fi{t) 
and /(i?, 0) be the two dimensional image that represents the UNL transform of 
The discrete UNL Fourier features of the object fi{t) are the normalized 
discrete Fourier spectrum UFF{u',v') = of the image 

f{R,9), ignoring F(0,0) and the values that are duplicated by conjugate sym- 
metry. In this paper, the dimension of extracted features is 76. Figure 6 shows 
the process to extract edge information of red-blood cells as a preprocessing step 
for feature extraction. 
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(b) Target cell 



Fig. 6. Feature extraction process of red-blood cells 



We have also tested various features of white-blood cells. We categorize the 
set of features into three groups. Basic features, such as color, size, intensity, 
ratio of the nucleus and cytoplasm, are included in first category. Second group 
features are circularity, eccentricity, elongatedness, convexity, and invariant mo- 
ment features for the shape of the nucleus. Texture features of the nucleus and 
cytoplasm are included in the third category, where we choose the best 60 fea- 
tures among them. 

In order to construct the classifier efficiently, we can extract image features by 
filtering and wrapping[ll]. In this paper, we use the principal component anal- 
ysis (PC A), one of the popular filtering methods, to extract lower dimensional 
features by analyzing multi-dimensional features statistically [12]. For red-blood 
cells, we reduce the extracted 76 dimensional features to 38 dimensional features 
in the first recognition step by applying PCA. In the second step, we can reduce 
the initial 76 dimensional features to 67 dimensional features. For white-blood 
cells, we reduce the 60 dimensional features to 52 dimensional features. Finally, 
each feature value is normalized to a number between 0 and 1. 

2.4 Classification 

In this chapter, we introduce a neural network classifier based on the back- 
propagation learning algorithm and compare the performance of the classi- 
fication model with the k-nearest neighbor (K-NN) and the learning vector 
quantization-3 (LVQ-3) algorithms. While K-NN is one of the statistical pat- 
tern classification methods, LVQ-3 is one of the clustering algorithms. 

Our classifier is a hierarchical neural network model to classify red-blood 
and white-blood cells using the back-propagation learning algorithm [2] . Classi- 
fication of red-blood cells consists of two steps. We assume that normal, target, 
spherocyte and stomatocyte cells of circular contour are included in the same 
class in the first step. Therefore, each input cell is classified into one of 12 classes. 
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If a cell has a circular contour, it is classified into one of 4 classes in the second 
step. The classifier for white-blood cells has the same architecture as the one for 
red-blood cells, but it has different parameters. 

Figure 7 shows a classifier consisting of two neural networks connected in 
cascade. Another neural network can be added to classify white-blood cells. 
Table 1 lists parameter values for the three neural networks (NN). 



Hidden 

Layer 




Circular — ► 
Crenated 

Triangular 




Normal 

Target 

Spherocyte 

Stomatocyte 



Fig. 7. Neural network architecture to classify red-blood cells 



The back-propagation learning algorithm is a general delta rule that controls 
the weighting factor by the following equation. 



W{new)ij = W(old)ij + aSjtti (5) 

where i,j : a neuron in the hidden and output layer, respectively, 

W(new)ij: modified weight between neuron i and neuron j, 

W{pld)ij'. previous weight between neuron i and neuron j, 

€j = tj — Uj'. neuron error in output layer, 

€j = WjkSk- neuron error in hidden layer, 

Sj = a,(l — aj)ej'. delta of neuron j, 

a : learning rate, 

a,: activation value of neuron i, 

Uj : activation value of neuron j , 

6j : error of neuron j , 

tj: value of target pattern if neuron j is in output layer, 

Wjk'- weight of neuron k in previous layer if neuron j is in the hidden layer, 
Sk'- delta of neuron k in previous layer if neuron j is in the hidden layer. 

We use an adaptive learning algorithm to reduce the learning time of the 
neural network and to find the local minima. Let Wij{old) be the current weight 







Automatic Cell Classification in Human’s Peripheral Blood Images 233 



and Wij (older) be the previous weight. The current momentum is the difference 
between the current weight and the previous weight. Therefore, the general delta 
rule can be modified as 



W(new)ij = W(old)ij +aSjai+ /3AWij(old) (6) 



where /? is a constant that controls the momentum. 



Table 1. Parameter values for neural networks 



Classifier 


Parameter 


Slope of 
activation 
function 


Learning 

constant 


Nodes 
for input 
layer 


Hidden 

layer 


Nodes for 
hidden 
layer 


Nodes for 
output 
layer 


Momentum 

constant 


NN 1 


0.1 


0.5 


38 


2 


125 


12 


0.9 


NN 2 


0.1 


0.5 


38 


1 


120 


4 


0.9 


NN 3 


0.1 


0.5 


52 


1 


80 


5 


0.9 



3 Experimental Results 

In our experiment, we have used Wright dyed blood images collected from two 
hundred patients in the hospital. In order to train the classifier, we use 680 test 
cells for fifteen classes of red-blood cells, and 70 of monocyte, 50 of basophil, 120 
of neutrophil, 50 of esinophil, and 120 of lymphocyte for five classes of white- 
blood cells. The data set is verified by human expert. We select the leave-one-out 
method for the test and apply PCA to reduce the feature dimension. The original 
feature dimension of 76 has been reduced to 38 in the first recognition step, 67 
in the second one for the red-blood cells. The initial 60 dimensional features for 
white-blood cells are reduced to 52 dimension. 

We have compared our classifier to other two classifiers, K-NN and LVQ-3 
for classifying red-blood cells. Figure 8 shows recognition rates for red-blood 
cells in the first and the second steps, respectively. Figure 9 shows the improved 
recognition rates acquired with the reduced features after applying PCA. MLP in 
Figure 8 and Figure 9 indicates our classifier based on the multiplayer perceptron. 

After classification, the result is compared with that of human expert. Our 
experimental results show that we can obtain improved recognition rates using 
features of reduced dimension. However, the recognition rate is low in the first 
step, since it is difficult to distinguish burr cells from crenated cells with nearly 
the same contour shape. This problem can be improved by more precise contour 
extraction. Table 2 and Table 3 list the average recognition rates with the original 
features and the reduced features, respectively. 
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Table 4 presents the final recognition result of white-blood cells by a confusion 
matrix with the reduced features. As we can observe in Table 4, most recogni- 
tion errors are between monocyte and lymphocyte and between neutrophil and 
esinophil, since they have the same characteristics in the shape, the size and 
the color of the nucleus and cytoplasm. Moreover, it is very difficult even for 
a human expert to distinguish the monocyte from the abnormal lymphocyte, 
because they have the same characteristics in terms of the size, color and ratio 
of neucleus and cyteoplasm. 




(a) First step 



(b) Second step 



Fig. 8. Recognition rate (%) of red-blood cells 
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(a) First step 



(b) Second step 



Fig. 9. Recognition rate (%) after PCA 
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Table 2. Average recognition rate (%) before PCA 



Recognition Step 


Classitier 


K-NN 


LVQ-3 


MLP 


Second Step (12 Classes) 


75 


87 


87 


Second Step (4 Classes) 


90 


91 


94 



Table 3. Average recognition rate (%) after PCA 



Recognition Step 


Classitier 


K-NN 


LVQ-3 


MLP 


Second Step (12 Classes) 


73 


78 


87 


Second Step (4 Classes) 


88 


87 


94 



Table 4. Confusion matrix for white-blood cells 





Monocyte 


Basophil 


Neutrophil 


Esinophil 


Lymphocyte 


Monocyte 


55 


0 


2 


0 


13 


Basophil 


0 


38 


2 


10 


0 


Neutrophil 


11 


0 


97 


8 


4 


Esinophil 


0 


2 


6 


42 


0 


Lymphocyte 


8 


0 


4 


0 


98 



4 Conclusions 

In this paper, we have proposed a new scheme to recognize and classify red- 
blood and white-blood cells in the human peripheral blood image. We have 
also described a classification model based on the neural network. We classify 
red-blood cells in two steps using inner edges and contour information, and 
white-blood cells using various features of the nucleus and cytoplasm. We have 
proposed a new algorithm to segment the nucleus and cytoplasm of white-blood 
cells. In addition, we show that complexity of the neural network can be reduced 
and a more efficient system can be constructed by applying PCA to features 
extracted from cells. The recognition rate for red-blood and white-blood cells is 
91% and 81% on average, respectively. 
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Abstract. Domain-specific information retrieval normally depends on general 
search engines, or systems which support browsing using handcrafted 
organisation of documents, but such systems are costly to build and maintain. 
An alternative approach for specialised domains is to build a retrieval system 
incrementally and dynamically by allowing users to evolve their own 
organisation of documents and to assist them in ensuring improvement of the 
system’s performance as it evolves. This paper describes a browsing 
mechanism for such a system based on the concept lattice of Formal Concept 
Analysis (FCA) in cooperation with incremental knowledge acquisition 
mechanisms. Our experience with a prototype suggests that a browsing scheme 
for a specific domain can be able to be collaboratively created and maintained 
by multiple users over time. It also appears that the concept lattice of FCA is a 
useful way of supporting the flexible open management of documents required 
by individuals, small communities or in specialised domains. 



1 Introduction 

Broadly speaking there are two ways in which a user interacts with document retrieval 
systems. In one the user formulates a specific query and some documents are 
retrieved in response. This process is normally iterative in that the user refines (or 
changes) the query on the basis of the documents retrieved by each query. In the 
second approach the documents are grouped and the document groups organised into 
some sort of structure that can be browsed. That is, from any point in the structure at 
least some other related parts of the structure can be identified and moved to. 

The ideal would be that specific queries would always produce the most relevant 
documents. Despite improvements in this area (e.g. Google), specific queries remain 
very frustrating: the only search terms the user can think of occur in myriad other 
contexts and perhaps even do not occur in some relevant documents. As a result a 
browsing approach is supported in many information retrieval systems. With 
browsing users quickly explore the search domains and can easily acquire domain 
knowledge [18]. Typically, a hierarchy is used for browsing and documents are 
grouped using some sort of clustering algorithms. Hierarchical Agglomerative 
Clustering algorithms are probably the most commonly used. The problem with a 
hierarchical clustering is category mismatch [7, 12]. If one goes down the wrong path 
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one must go back up the hierarchy and start again. There is no mechanism for 
navigating to other clusters, as there is only a simple taxonomy structure. A further 
critical issue in a browsing scheme is the origin of the terms by which the documents 
are grouped. One can attempt to arrive at some global taxonomy to satisfy all possible 
users as used by sites like Yahoo and the Open Directory Project (http://dmoz.org/). 
In these global systems the category mismatch problems can be very severe. 

As an alternative, documents can be organised using ontologies for browsing for a 
specific domain. That is, one can build a specialised ontology to be used by a specific 
community, with the assumption that within the community there will be consensus 
on the terms. A good example is the (KA)^ initiative [2]. (KA) ^ starts out with an 
ontology appropriate to the domain with the expectation that people in the community 
will annotate documents according to the ontology. These same users should also be 
able to use the ontology to retrieve documents entered by others and an end-user can 
retrieve relevant documents by navigating an ontological browser formulated in a 
hierarchy. There are likely to be considerable practical advantages to even very large 
communities committing to specific ontologies, and part of education would be to 
learn these ontologies. Despite the practical advantages of a community committing 
to ontology, we have long held the view that at base any knowledge structure is a 
construct which should be allowed to evolve over time [6]. 

Hence rather than committing to an a priori ontology and expecting that all 
documents will be annotated according to the ontology, our aim is to explore the 
possibilities of a system where the user can annotate a document however they like 
and that the ontology will evolve accordingly. Rather than this being totally ad hoc, 
we would like the system to assist the user to make extensions to the ontology that are 
in some way improvements. We are not concerned with automated or semi-automated 
ways of discovering an ontology appropriate to a document or corpus [1, 17]. Despite 
the potential of such approaches, from our more deconstructionist perspective, we are 
more interested in the role of the reader or user interpreting documents and deciding 
on their annotation and development of an ontology. However, this does not preclude 
the inclusion of ontologies either constructed by an expert or an ontology imported 
from elsewhere, as part of the ontological structure preferred by the user. 

An alternative to a hierarchy for browsing is a lattice-based navigation scheme using 
Formal Concept Analysis (FCA). In this approach, a document is annotated by an 
expert with a set of controlled terms. From this a concept lattice is constructed using 
the given mathematical formulae of FCA. The significant advantage of this approach 
is that the mathematical formulae produce a conceptual structure which automatically 
provides generalisation and specialisation relationships among the concept nodes. 
This lattice structure allows one to reach a group of documents via one path, but then 
rather than going back up the same hierarchy and guessing another starting point, one 
can go to one of the other parents of the present node reducing the problem of 
category mismatch. In this paper we describe a system based on the lattice browsing 
supported by FCA but supporting incremental development of the system over time. 
The system assists users in ensuring improvement of the system’s performance as it 
evolves. 

We have previously demonstrated incremental development of document 
management systems based on selecting keywords that discriminate between 
documents [14, 15] using the Ripple-Down Rule (RDR) knowledge acquisition and 
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maintenance methodology. The RDR approach was initially developed for knowledge 
acquisition for knowledge based systems [6]. Although, as demonstrated in other 
RDR work, RDR greatly assists context-specific knowledge acquisition it does not 
organise the knowledge in a way that is suitable for browsing. One of the aims here is 
to integrate the RDR incremental approach with the browsing advantage of FCA. 
FCA has previously been used with RDR expert systems as an explanation tool [20]. 

A prototype has been implemented (http://pokey.cse.unsw.edu.au/servlets/Search) 
and demonstrated with a test domain of around 200 papers from the Banff Knowledge 
Acquisition Workshops (http://ksi.cpsc. ucalgay.ca:80/KAW). Another test domain 
(http://pokey.cse.unsw.edu.au/servlets/RI) is for research topics in the School of 
Computer Science and Engineering, UNSW. There are around 150 research staff and 
students in the School who generally have homepages indicating their research 
projects. The aim here was to allow staff and students to freely annotate their pages so 
that they would be found appropriately within the evolving lattice of research topics. 
The goal is a system to assist prospective students and potential collaborators in 
finding research relevant to their interests. 



2 System Overview 

Figure 1 shows an overview of the system. A user can annotate his/her own document 
with a set of keywords by selecting keywords already used in the system which have 
been added by others or by entering further textwords which in turn will be available 
to future users. The user is provided with a list of keywords already available. After 
an initial selection, the system indicates keywords that have been used together with 
the keywords already selected for other documents. Through these and further 
knowledge acquisition steps, the initial keywords can be refined. Then the case (a 
document with a set of keywords) is added into the system. After that, the system 
rebuilds the concept lattice to cope with the new case. Figure 2 (a) shows an example 
of a lattice. The concept lattice is a data structure either for indexing documents or for 
browsing. The keywords set is used for the indexing as shown in Figure 2 (b). 




Fig. 1. An overview of the system 
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Fig. 2. Examples of the browsing structure (a) Lattice structure (b) Indexing of the lattice (c) 
Nested structure 

The concept lattice is incrementally and automatically reformulated whenever a new 
case is added or the existing cases are changed. The user can also to give values for 
properties (attrihutes) defined for the domain ontology hy an expert or the system can 
automatically extract the values of attributes from the content of documents. This 
requires a prior domain ontology in the same way as (KA)^ and is included in our 
system only for completeness. We suggest that it will be used only the most obvious 
attributes rather than for implementing a fully developed ontology. The attributes are 
accessed via nested browsing as shown in Figure 2 (c). The nested structure is 
constructed dynamically corresponding to the search results. That is, we build a 
concept lattice using the resulting documents with their keywords as an outer 
structure and from this produce a nested structure. 

The user specifies a query by entering any textwords in a conventional information 
retrieval fashion or by selecting a keyword from those that had used for annotating the 
documents. For a textword search a set of words is entered separated by ’ ’ and 
assuming the AND Boolean operator. Stopwords are first eliminated and the 
remaining query stemmed using the stemming classes. If a keyword has been selected 
or textwords identify some keywords, the system identifies the appropriate node and 
displays it together with its direct neighbours. The user can start navigation from this 
node. If the system does not include a node with the given keywords, it displays a 
sub-lattice which covers only documents that contain the textwords anywhere in the 
document. The user can navigate this sub-lattice, and also transfer to the same node in 
a lattice of all documents as required. If the textwords entered did not correspond to a 
node, the system also sends a log file to an expert so s/he can decide if more 
appropriate keywords are required for the documents. 



3 Formal Concept Analysis for the System 

Formal Concept Analysis (FCA) is a mathematical theory which formulates the 
understanding of ’concept’ as a unit of thought comprising its extension and intension 
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as a way of modelling a domain [11, 22] . The extension of a concept is formed by all 
objects to which the concept applies and the intension consists of all attributes 
existing in those objects. FCA generates a conceptual hierarchy of the domain by 
finding all possible formal concepts which reflect the relationships between attributes 
and objects. The resulting subconcept-superconcept relationships between formal 
concepts are expressed in a concept lattice which can be seen as a semantic net 
providing "hierarchical conceptual clustering of the objects... and a representation of 
all implications between the attributes” [23]. More detailed definitions and examples 
can be found in [ 1 1 ] . 



3.1 Formal Contexts and Formal Concepts 

The most basic data structure of FCA is a formal context. The set of objects and their 
attributes constitute a formal context (K) = (G, M, I). G is a set of objects, M is a set 
of attributes and I is a binary relation between G and M which indicates where an 
object g has an attribute m by the relationship gim (also by (g, rti) g I). In the original 
formulation of FCA, objects were implicitly assumed to have some sort of unity or 
identity so that the attributes applied to the whole object; e.g. a dog has four legs. 
Clearly documents do not have the sort of unity where attributes will necessarily 
apply to the whole document. However at this stage of this work we suppose that 
documents correspond to objects and the keywords or terms attached to documents by 
a user constitute attribute sets. We define a formal context (C) for our document 
retrieval system as follows. 

Definition 1: A formal context is a tripe C = (D, K, I) where D is a set of documents, 
K is a set of keywords and I is a binary relation which indicates where a document d 
has a keyword k by the relationship (also by {d, k) e I). 

For example. Table 1 shows the formal context of C where D is (1, 2, 3, 4], K is 
[artificial intelligence, information retrieval, machine learning, decision tree, natural 
language processing, speech recognition, signal representation] and the relation I is 
1(1, artificial intelligence), (1, information retrieval),..., (4, artificial intelligence), (4, 
natural language process), (4,speech recognition), (4, signal representation)}. 

Then, formal concepts are derived from the formal context using the basic definition 
Xc D: X HA X' := |A:gK | VrfeX: (d, k) g I], Yc K: Y hA Y' := (^/gD | VkG Y: {d, k) 
G I]. A formal concept is defined as a pair (X, Y) such that X c D, Y c K, X' = Y 
and Y' = X where X and Y are called the extent and the intent of the concept (X, Y). 
More detail mathematical formulae and procedures can be found in [11, 22]. A node 
in Figure 3 represents a formal concept. 



Tablel. Part of formal context in our application 





Artificial 

intelligence 


Information 

retrieval 


Machine 

learning 


Decision 

tree 


Natural language 
processing 


Speech 

recognition 


Signal 

representation 
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X 


X 












2 


X 




X 


X 








3 


X 


X 






X 






4 


X 








X 


X 


X 




242 



M. Kim and P. Compton 




Fig. 3. Concept lattice of the formal context in Table 1 



3.2 Concept Lattice 

The formal concepts of C are expressed in a concept lattice £ (D, K, I) which is the 
conceptual structure of FCA and ordered by the smallest set of attributes. The 
structure is reformulated incrementally and automatically by adding a new case and 
refining the existing cases. To build a concept lattice we need to find the subconcept- 
superconcept relationship between the formal concepts. This is formalised by (Xj, Yj) 
< (X„ Y,) X, c X, (<^Y, c Y,) where (X^, Y^) is called a subconcept of (X,, Y,) 
and (Xj, YJ is called a superconcept of (Xj, Yj). Figure 3 shows the concept lattice of 
the formal context C in Table 1. 



3.3 Conceptual Scaling 

Conceptual scaling has been introduced in order to deal with many-valued attributes 
[10]. A many- valued context is defined as a formal context (K) = (G, M, W, I) where 
G is a set of objects, M is a set of attributes, W is a set of attribute values. 1 is a 
ternary relation between G, M and W which indicates where an object g has the 
attributes value w for the attribute m. Then, if a user is interested in analysing the 
interrelationship between attributes, he/she can choose the required attribute(s) from 
the multi-valued context and build a formal context for the attribute(s). This process is 
called conceptual scaling. The concept lattices are structured for each of the separate 
formal contexts. A concept lattice is derived by combing several concept lattices into 
'nested line diagrams' (e.g. TOSCANA) or a new form of a lattice structure. Table 2 
is an example of many- valued contexts in our domain. We build a concept lattice with 
a set of documents with their keywords as an outer structure. Then, we scale up using 
other attributes into a nested structure. The nested structure is constructed 
dynamically in response to the outer structure. Conceptual scaling is also applied to 
one- valued contexts in order to reduce the complexity of the visualisation [21]. In our 
present system, an expert can group relevant attribute values from the formal context 
C = (D, K, I) in the definition 1. The process is incorporated with building a 

thesaurus which is not addressed here as it is beyond the scope of this paper. Then, 
when a query is associated with the thesaurus, conceptual scales are derived on the fly 
to group the relevant terms as the nested attributes. 
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Table 2. An example table for many-valued contexts 





Keywords 


Authors 


Proceeding title 


Publication year 


Document 1 


kl, k2, k3 


al, a2 


KAW 


1998 


Document2 


kl, k3, k4 


a3, a2 


EKAW 


1999 


Documents 


kl, k2, k5, k6 


a4, a5, a6 


PKAW 


2000 













3.4 Incremental Constrnction of the Concept Lattice 

Many different algorithms exist for generating a concept lattice from a given formal 
context [3, 9, 13]. However, we have developed a further incremental algorithm to 
construct the concept lattice. In our approach, the concept lattice is incrementally 
changed by adding a new case and refining the existing cases. The following is a 
brief explanation of the algorithm. 

Assume the existing formal context C = (D, K, I) where D is a set of documents, K 
is a set of keywords and I is a binary relation between D and K. Then, let 6(C) be the 
set of all formal concepts of the formal context C. A formal concept of the context C 
consists of a pair (X, Y) where X and Y are called the extent and the intent 
respectively. Now let X“ be all extents and Y“ be all intents of 6(C). In adding a new 
document, let be a new document and T be the set of keywords of d. Then, an 
extended formal context of C is defined as C^ = (D*, K*, T) where = D u {cf}, = 

K u r and T = I u {(rf, k) I k g T}. In the case of refining an existing case, D* = D 
and = (K - T°) u T” where T° is the set of keywords associated with the document 
from among existing keywords and T° is the new set of keyword for this document. 

Then the following procedure is applied for each element k of T. The system 
formulates a formal concept (X, Y) where X is the set of documents which is 
associated with the element k and Y is {kj, and determines the intersection of X with 
the X“ of 6(C). If the intersection does not exist in X”, the system reformulates the 
formal concept (X, Y) where X is the intersection and Y is {k} and adds the concept 
into £(C). After this process, the extended set of all formal concepts 6^(C) is 
composed. But 6*(C) can include a common attribute component contrary to FCA. 
We eliminate the formal concepts in the common attribute component except for the 
maximal concept of 6*(C) defined with the largest object component. For reference, 
the object components of the common attribute component are in a total subsumption 
relationship. Then, subconcept and superconcept relationships are reformulated for 
all formal concepts which include the keywords F of d. This results in a new lattice 
£(D", K", T) of the context C". 



4 Incremental Knowledge Acquisition Mechanisms 

Knowledge acquisition is carried out when a new document is added with a set of 
keywords or the keywords of existing documents are refined. When an expert/user 
assigns the set of keywords for a document, some keywords may be missed. The 
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system guides the user to discover possible missed concepts through a number of 
steps. The knowledge acquisition mechanisms are based on FCA and RDR 
techniques. The following definitions are used. 

Definition 2: Let C = (D, K, I) be a formal context, and cf be a new document {d i. D) 
and r be the set of keywords of d. The set of keywords is not necessarily a subset of 
K. Then, the extended formal context of C is defined as = (D*, K*, T) where = D 
u {d}, K" = K u T and r = I u {(d, k)\ke T). 

Definition 3: Let C = (D, K, I) be a formal context and T be a set of keywords (T c 
K). Then the set of documents associated with T is defined to be Ap = {<i e D I 3k e 
r such that (d, k) e I } . 

We introduced Ap to get a set of documents, which has at least one keyword of T. If 
T is a singleton (i.e. T= {y}), then we will abbreviate Ay = Ap= {d e T) \ {d, I}. 

Definition 4: Let C = (D, K, I) be a formal context. We define a function / from D to 
2’" as /: 2^ such that f (d) = {k e K\ (d, k) e 1} . 

That is, / (d) returns the set of keywords of d. Let the new document be d D) with 
the set of keywords T. We formulate the sub-formal context C' = (D', K', T) with D'= 
Ap-i- \d} where Ap is in definition 3 and K' = f (d) where f is the function in 
definition 4. In order to get a set of relevant keywords of d, we obtain a set of 
keywords which are associated with Ap as /(Ap) = f (d) from the context C'. 
Now the set of relevant keywords is defined as 91 = /(Ap) - T. Then, the function 
Freq introduced below is used for each keyword of 91 (k) to compute the number of 
common keywords of T with the keywords of all the documents that have the 
keyword k from the context C . 

Definition 5: We define a function Freq from 2*^ X K to the set of natural numbers N 
as follows: Freq: 2"^ X K ^ N such that Freq (T, k) = y; I / (</) n T I where IXI is 
the cardinality of X. 

The user can annotate his/her document with a set of keywords by entering any 
terms or selecting known terms. The system displays all the keywords used by other 
annotators to be able to share and reuse them. After this initial assignment, the user 
can view the other terms that co-occur with the terms s/he has provided and can 
annotate the document with these further terms if desired. The terms are presented to 
the user ordered by their frequency in the lattice, normalised for the number of terms 
at the node, and their ’closeness’ to the node to which the document is assigned by the 
user’s initial choice of terms in the conceptual hierarchy. 

In a more detailed explanation, an ordered set of documents and a set of keywords 
which are relevant to the new document are obtained. A sub-lattice £' (D', K', T) of 
the formal context C' described in above is then constructed. This step is divided into 
two stages. In the first stage, the ordered documents are shown to the user along with 
the features that are different between the new document and each of the set of 
documents. Given a new document d, we are interested in finding the set of 
documents Dj that share some commonalties. We formulate a formal concept ^ ({d},f 
(d)) with the newly added document d and its set of keywords T. Starting from the 
concept ^ we recursively go up to the direct superconcepts of its subconcept in the 
lattice to find the next level of the relevant documents. This procedure is done until 
the superconcept reaches the top node of the lattice. 




Formal Concept Analysis for Domain-Specific Document Retrieval Systems 



245 



At the second stage, we elicit the relevant keywords which are associated with the 
newly added document d. Then, a weight for each relevant keyword is calculated by 
definition 5. Then, the ordered relevant keywords are presented to the user with their 
relevant weight. After that, the system asks the user the relevancy for each extracted 
keyword. The user can also view the sub-lattice and the relevant documents for each 
of the relevant keywords during this process. The similarity relation between 
keywords and documents can be easily observed through the lattice. 

When the above stage is complete the document is located at a node. If there is 
another document(s) already at the node, the user adding the new document is 
presented with the previous document and asked to include keywords that distinguish 
the documents. The user can chose to leave the two documents together with the same 
keywords. Ultimately however, every document is unique and offers different 
resources to other documents and probably should be annotated to indicate the 
differences. The approach used here is derived from Ripple-Down Rules, but the 
location of the document is determined by the lattice structure rather than the history 
of the development. 

Another knowledge acquisition we have addressed is when a new term is entered for 
a new document; this term may also appropriately apply to other documents already 
in the system. This problem can be left until the system fails to provide an appropriate 
document for a later search as the RDR approach. However, in our approach, the 
system passes a log of the addition of a new document to a meta-expert. The expert 
then considers whether any document at the parent nodes for the new nodes should 
also have the term added. The following definitions are used in formulating the 
relevant documents and their associated new terms for the new added case. 

Definition 6: Let £ = < V, < > be a lattice. Given a node 0eV, the set of direct 
parents of 9 denoted (9) is defined as follows: (9) = {ae V I 9<a and there 

does not exist any (3g V such that 9<(3 & (3<a}. 

Definition 7: Let £(C) be a concept lattice of the formal context C = (D, K, I) and d 
be the new document. For each document 5eD, we can define the set of relevant 
keywords for 5 with respect to d denoted Relj(6) as follows: 

UY 

Rel^{6)= {f (d) \ (X,Y)sDP^c)«{d],f(d)>)&SsX} 

As the system evolves, new terms are being added. As a consequence, there is a 
necessity to handle synonyms or to group the relevant terms together for extending 
the users' query. For this reason, we support a tool for experts of the system to build a 
thesaurus for the involved domain whenever it is required. We have developed a 
mechanism to discover new concepts when a new case is added by connecting to this 
process that is to hold the compatibility condition (is-a relationship) in the thesaurus 
hierarchy. Another mechanism is motivated from when the system can not find a node 
in the lattice with a query. In this case, the system sends a log file to an expert so s/he 
can decide if more appropriate keywords are required for documents. The expert then 
sends e-mail to the author (annotator of the document) by attaching a hyperlink which 
can facilitate to refine the keywords of the document if it desires. All interactions 
between the system and users are also logged. We are analysing the log file to find 
effective factors or users' behaviours to influence the performance of the system. The 
user can immediately view the changed concept lattice and further decide whether the 
set of keywords they have assigned for the document is appropriate. 




246 



M. Kim and P. Compton 



5 Related Work 

Formal concept analysis has developed to have a wide range of application in 
medicine, psychology, libraries, software reengineering and ecology, and has applied 
to a variety of methods for data analysis, information retrieval and knowledge 
discovery in databases. A number of researchers have proposed this lattice structure 
for information retrieval. Here we consider only where FCA has been applied to 
documents [4, 5, 12, 19]. 

Godin et al. [12] studied the advantage of the lattice method against hierarchical 
classification and also evaluated retrieval performance. Hierarchical classification 
retrieval showed significantly lower recall compared to the lattice-based retrieval and 
Boolean querying. Between the lattice-based retrieval and Boolean retrieval no 
significant performance difference was found, but they strongly argued the 
advantages of a lattice structure for browsing. Carpineto and Romano [4] determined 
that the performance of lattice retrieval was comparable to or better than Boolean 
retrieval on two medium-sized databases. Carpineto and Romano [5] also used a 
thesaurus as background knowledge in formulating a browsing structure and 
presented experimental evidence of a substantial improvement after the introduction 
of the thesaurus. Godin et al. [12], and Carpineto and Romano [4] systems were both 
implemented on a stand-alone microcomputer. More recently FCA has been used for 
document retrieval culminating in faceted information retrieval system (FaIR) [19]. 
This classifies the documents using a faceted knowledge representation based on a 
thesaurus or knowledge base, but a browsing scheme is not yet deployed. 

The focus in previous work was to examine the advantages and capabilities of the 
lattice-based retrieval. The main difference in the work here is an emphasis on 
incremental development and evolution, and knowledge acquisition tools to support 
this for specialised domains. Our aim is a browsing scheme which can be 
collaboratively created and maintained and where users evolve their own organisation 
of documents but are assisted in this to facilitate improvement of the system’s 
performance as it evolves. 

Another difference is that our focus is on a web-based system using a hypertext 
representation of the links to a node, but without a graphical display of the overall 
lattice. Lin [16] discussed how visualisation through a graphical interface could 
enhance information retrieval and generally the browsing mechanism in the 
application of FCA is based on exploring the lattice graph itself. However, we 
anticipate that most web users are unfamiliar or uncomfortable with concept lattice 
diagrams and viewing of the whole lattice diagrams will also remain a problem. Even 
though we agree the lattice diagram (graphical relationships of the concepts) can be a 
useful tool to review and explore the whole map of a domain, we believe that the 
hyperlink technique is a fairly natural simplification for a lattice display and is also 
very natural for Web users. 

A further difference is that we also support a textword search which is invoked 
automatically to identify the relevant documents from the context of the documents 
when the system fails to get a result from the lattice nodes. Conceptual scaling is also 
supported to handle multi-valued attributes which are obvious in the domain and to 
allow related values of single-valued contexts to be grouped together as the system 
evolves. 
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6 Conclusion 

The system described above has been implemented and demonstrated with two test 
domains. There is little doubt that it seems to facilitate browsing and that users 
adding documents enjoy seeing how their document fits into the lattice and are 
motivated to make sure it is appropriately positioned. FCA is widely used for 
knowledge acquisition to discover concepts and rules related to objects and their 
attributes. Its advantage comes from the way it shows how the presence or absence of 
attributes distinguishes objects in the various super-concept sub-concept relations. In 
the system here a key feature is the incremental development of the knowledge base 
via a web-based interface. The key extension to FCA that we have implemented is 
similar to the philosophy of both RDR and Repertory grids [8] where an expert is 
asked to gradually build up axes of differentiation (constructs) between objects. Our 
system suggests keywords that the document may have in common with other 
documents and also indicates documents that have the same annotation and which the 
author may wish to differentiate. These techniques are very simple but their utility is 
well established in these other areas of incremental knowledge acquisition. 

A more substantial evaluation is being undertaken. In this evaluation, the FCA 
browsing scheme is being used as the search mechanism for research interests in a 
research institution. The researchers themselves add the keywords related to their 
home page, the target document. We anticipate that more complex ontologies may be 
useful in applications such as this. We are investigating techniques whereby the 
ontology can be constructed as the system develops, as suggested in this paper, but 
also techniques where ontology can be imported and used for making suggestions to 
the user, but where the user retains complete control in using existing or adding new 
terms. 

From our experience so far with this development it is clear to us that Formal 
Concept Analysis is a useful way of supporting the flexible open management of 
documents required by individuals, small communities or in specialised domains. It 
also appears that our approach can apply to conceptual modelling of domain 
taxonomies to be collaboratively created and maintained over time by multiple users 
(or authors) without the mediation of knowledge engineers. However, these apparent 
strengths and further possibilities still require a more thorough evaluation. 



Acknowledgments. The authors would like to thank Bao Vo and Dr. Rex B.H. Kwok 
for helping in formalising of mathematical formulas used in definitions. This research 
is supported by an Australian Research Council (ARC) grant. 



References 

1. Aussenac-Gilles, N., Biebow B. and Szulman S. Revisiting Ontology Design: A 
Methodology Based on Corpus Analysis, 12"' European Conference on Knowledge 
Acquisition and Knowledge Management (EKAW 2000), Springer, 172-188, 2000. 




248 



M. Kim and P. Compton 



2. Benjamins, V. R., Fensel, D., Decker, S. and Perez, A. G. (KA)^: building ontologies for 
the Internet: a mid-term report. International journal of human computer studies, Vol. 51, 
No. 3, 687-712, 1999. 

3. Carpineto, C. and Romano, G. GALOIS: An Order-Theoretic Approach to Conceptual 
Clustering, In Proceedings of the Machine Learning Conference, 33-40, 1993. 

4. Carpineto, C. and Romano, G. Information retrieval through hybrid navigation of lattice 
representations. International Journal of Human-Computer Studies, 45, 553-578, 1996. 

5. Carpineto, C. and Romano, G. A Lattice Conceptual Clustering System and Its 
Application to Browsing Retrieval. Machine Learning, 24(2), 95-122, 1996. 

6. Compton, P. and Jansen, R. A Philosophical Basis for Knowledge Acquisition. 
Knowledge Acquisition 2:242-257, 1990. 

7. Furnas, G. W., Landauer, T. K., Gomez, L. M. and Dumais, S. T. Statistical semantics: 
analysis of the potential performance of key-word information systems. Bell System 
Technical Journal, 62, 1753-1806, 1983. 

8. Gaines, B. and Shaw, M. Cognitive and Logical Foundation of Knowledge Acquisition. 
The S"' Knowledge Acquisition for Knowledge Based Systems Workshop, 9.1-9.25, 1990. 

9. Ganter, B. Computing with Conceptual Structures, Proceedings of the S"" International 
Conference on Conceptual Structure (ICCS 2000), Darmstadt, Springer, 453-467, 2000. 

10. Ganter, B. and Wille, R. Conceptual Scaling, In: F. Roberts (ed.): Application of 
Combinatorics and Graph Theory to the Biological and Social Sciences, Springer, 139- 
167, 1989. 

11. Ganter, B. and Wille, R. Formal Concept Analysis: mathematical foundations. Springer, 
Heidelberg, 1999. 

12. Godin, R., Missaoui, R. and Alaoui, H. Learning algorithms using a Galois lattice 
structure. Proceedings of the Third International Conference on Tools for Artificial 
Intelligence, San Jose, CA: IEEE Computer Society Press, 22-29, 1991. 

13. Godin, R., Missaoui, R. and Alaoui, H. Incremental concept formulation algorithms based 
on Galois (concept) lattices. Computational Intelligence, 11(2), 246-267, 1995. 

14. Kang, B. H., Yoshida, K., Motoda, H. and Compton, P. Help Desk System with 
\nte\\igsnt\nle,TidLce., Applied Artificial Intelligence, 11: 611-631, 1997. 

15. Kim, M., Compton, P. and Kang, B. H. Incremental Development of a Web Based Help 
Desk System, Proceedings of the 4th Australian Knowledge Acquisition Workshop 
(AKAW99), University of NSW, Sydney, 13-29, 1999. 

16. Lin, X. Map Displays for Information Retrieval, Journal of the American Society of 
Information Science, 48:40-54, 1997. 

17. Maedche, A. and Staab, S. Mining Ontologies from Text, 72“ European Conference on 
Knowledge Acquisition and Knowledge Management {EKAW), Springer, 189-202, 2000. 

18. Marchionini, G. and Shneiderman, B. Binding facts vs. browsing knowledge in hypertext 
systems, IEEE Computer, 21, 70-80, 1988. 

19. Priss, U. Faceted Information Representation, In: Stumme, Gerd (ed.), working with 
Conceptual Structures. Proceedings of the 8th International Conference on Conceptual 
Structures, Shaker Verlag, Achene, 84-94, 2000. 

20. Richards, D. and Compton, P. Knowledge acquisition first, modelling later. Knowledge 
Acquisition, Modeling and Management, E. Plaza and R. Benjamins, Berlin, Springer: 
237-252, 1997. 

21. Stumme, G. Hierarchies of Conceptual Scales. 72“ Banff Knowledge Acquisition, 
Modelling and Management, Eds. B Gaines; R Kremer; M Musen, Banff Canada, 16-21 
Oct., SRDG Publication, University of Calgary, 1999. 

22. Wille, R. Restructuring lattice theory: an approach based on hierarchies of concepts. In: 
Ivan Rival (ed.). Ordered sets, Reidel, Dordrecht-Boston, 445-470, 1982. 

23. Wille, R. Concept lattices and conceptual knowledge systems. Computers and Mathema- 
tics with Applications, 23, 493-515, 1992. 




Learner’s self-assessment: a case study of SVM 
for information retrieval 



Adam Kowalczyk and Bhavani Raskutti 
{Adam.Kowalczyk, Bhavani. Raskutti}@team. telstra.com 

Telstra Corporation, 770 Blackburn Road, Clayton, Victoria 3168, Australia 



Abstract. The paper demonstrates that the predictive capabilities of a 
typical kernel machine on the training set can be a reliable indicator of 
its performance on the independent test set in the region where scores 
are larger than 1 in magnitude. We present initial results of a number 
of experiments on the popular Reuters newswire benchmark and the 
NIST handwritten digit recognition data set. In particular, we demon- 
strate that the values of recall and precision estimated from the training 
and independent test sets are within a few percent of each other for the 
evaluated benchmarks. Interestingly, this holds for both separable and 
non-separable data cases, and for training sample sizes an order of mag- 
nitude smaller than the dimensionality of the feature space used (e.g. 
using « 2000 samples versus « 20000 features for Reuters data). 

A theoretical explanation of the observed phenomena is also presented. 

1 Introduction 

Many of us can recall school days, when studying for a biology or chemistry test, 
we were quite conscious of which parts of the task at hand we have learned well 
and which we have not. And the satisfactory (or not) solutions of previously un- 
seen problems in the school test the next day confirmed that our self-assessment 
was quite correct. Whether a learning machine is capable of similar introspection 
and can assess what it has learnt without referring to an external (independent) 
test is a fundamental question bordering on the issue of self awareness. What 
we have in mind here is not a test based on a hold-out validation set, or a cross- 
validation assessment, like it is done in the case of decision tree generation, but 
an assessment based on uniform treatment of all of the training set and some 
simple information on the state of the machine. The message from this paper 
is that in some situations of practical interest such self-assessment can be done 
efficiently for support vector machines, even for small training samples. 

This is somewhat contrary to the accepted machine learning idea that es- 
timates based on training sets are notoriously optimistic when compared with 
true values. For instance, in supervised learning of a classification the experi- 
mental training errors are much smaller than the test (true) errors. Similarly, 
in the theory of learning systems, a typical upper bound on generalisation error 
consists of a training error plus a significant penalty term. This penalty becomes 
non-trivial (i.e. < 1) only in the “thermodynamic limit” of unrealistically large 
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training samples. In particular, the penalty term in the well known uniform 
bounds produced by VC-theory becomes < 1 only for a training sample many 
times the size of VC-dimension of the learning machine, which is normally much 
larger than the number of available training samples for practical machines. 

Against this background kernel machines seem to be a notable exception. It 
has been noticed recently [10] that in the case of support vector machines [13] 
or regularisation networks [4,8], training margin error rate can be proven to be 
an almost unbiased estimator of the true margin error rate. Interestingly, proofs 
hold for small training samples, explicitly smaller than the VC-dimension of the 
function class. This can be extended to estimators of various other risks which 
can be used to define measures of interest for information retrieval, e.g. recall 
and precision. 

Can such estimators be of practical relevance? The answer is not straight- 
forward. Firstly, practical machines are only suboptimal approximations of ideal 
solutions, and the imperfections may adversely impact on the properties of inter- 
est. Secondly, the proof of the above result relies essentially on assumptions, such 
as the iid sampling from continuous probability density, which are not satisfied 
in practice [7] . Thirdly, the estimators although unbiased may have variance too 
large to be of practical relevance. The prime aim of this paper is to test them 
experimentally in some domains of practical interest such as text categorisation 
(Section 3) and recognition of handwritten digits (Section 4), and offer some 
theoretical corroboration of the observed phenomena (Section 5) . 



2 Support Vector Machines and Estimators 



Consider an m-sample 

:= {{xi,yi),....,{xm,ym)) G (Vx{±l})™ (1) 

of patterns Xi G V C IR" and target values y, = ±1. The learning algorithms 
used by support vector machines (SVM) [1,2,14] minimise the regularised risk 
functional: 



= argmni?5^™[/] := \\f\\l^ + - yif{xi)]+). (2) 

1=1 

Here % denotes a reproducing kernel Hilbert space (RKHS) [14] of real valued 
functions f : X ^ W, the corresponding norm, C > 0 is a regularisation 

constant, L :W ^ K+ is a non-negative, convex cost function penalising for the 
deviation l — yif{xi) of the estimator f{xi) from target y, and [^]-|_ := max(0, ^). 
For L(^) := withp = 1 (linear cost) or p = 2 (quadratic cost), the minimisation 
(2) can be solved by quadratic programming [1] with the formal use of the 
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following expansions holding for the minimiser (2): 

m 

= '^aiyik{xi,x), (3) 

m 

||/3^™||-H = XI aiajyiyjk{xi,Xj), (4) 

*>i=i 

where k : XxX ^ W is kernel corresponding to the RKHS H [8]. Likewise, the 
quadratic programming gives a solution to the hard margin case [1, 2, 14] which, 
in terms of (2) corresponds to the cost L(^) := 1 for ^ > 0 and L(^) := 0, 
otherwise, and the constant C > 1/p, where p := max(„,) ||-j^ 

is the margin with which data can be separated by the kernel machines. 

3 Experiments with Reuters News-wires 

For our experiments, we have used the widely used text categorisation bench- 
mark, the modApte split of the Reuter s-21b79> news-wires collection available 
from http://www.research.att.com/lewis [3, 12, 16]. This split has 9603 training 
documents (ApteTrain) and 3,299 test documents (ApteTest) spread over 135 di- 
verse categories with varying frequency of occurrence. The modApte split assigns 
documents from April 7, 1987 and before to the training set, and the remain- 
der to the ApteTest set. This introduces a systematic bias in a sense used in 
statistics. (For instance, contrary to the common sense expectation, for some 
categories the training set is harder to classify than the test set, cf. Figures 1 
and 2.) Hence, in experiments we have used systematically two independent test 
sets, one was on the remainder of the ApteTrain after the training subset was 
randomly selected from it, and another on the ApteTest collection. 

The feature vector in these experiments consists of the 20197 unique words 
extracted from the ApteTrain documents, where the extraction involved case 
conversion, stemming and removal of words in a standard stop list [12]. We 
have used exclusively SVMs with linear kernel, k{x,x') = 1 + x ■ x' , since this 
is the simplest kernel machine and previous experiments have shown that more 
complicated kernels do not necessarily give better performance [3]. Hence our 
optimal machine has an expansion cayii^ + ^ where the 

Lagrangian coefficients (a,) are given as a solution of the following optimisation: 

m mm 

min ( X yiyj<^i<^j(^ + Xi ■ Xj) + C X[1 ~ yi'Y^ «i%(l - ■ ^j)]+) > (5) 

*,i=i i=i 

where p = 1 or 2. 

For a predictor f : X ^ W and a data sequence aif™ € (Wx{±l})™ we shall 
be evaluating estimators of reeall 

{i ; f(xi) > 8 kyi = l} 

{i ; yi = 1 } 



Recs^m [/, 8] : 



(6) 
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(A) Training Size = 20%, C = 800, p = 1, Target Size = 30% (B) Training Size = 50%, C’ = 800, p = 2, Target Size = 30% 




(C) T raining Size = 20%, C = 800, p = 1 , Target Size = 30% (D) T raining Size = 50%, C = 800, p = 2, Target Size = 30% 




Fig. 1. Precision and Recall estimates and the corresponding differences for Reuters 
benchmark, category = ‘earn’ (30%), the largest category. Experimental settings were 
C' = 800, p = 1, m = 1920 for Figures A and C and p = 2, m = 4801 for Figures B 
and D. The systematic bias in ApteTest is clearly evident, e.g. contrary to the common 
sense expectations, we observe the better recall for ApteTest than for the training set. 



and precision: 



Precj^m [/, e] 



{i ; f{xj) > 0 kyi = l) 
{i ; f{xi) > e) 



( 7 ) 



First, we have generated 100 random splits of ApteTrain collection into the 
training (20% or 50% of the data) and test sets (the remaining 80% or 50%, 
respectively). Estimators were calculated for the the training, the test and, ad- 
ditionally, for ApteTest sets for thresholds in the range —2<0<+2. This 
has been repeated for each experimental setting: p = 1,2, learning constants 
C = C Im, where C" = 10,800 and m is the number of training examples, 
TO = 1920,4801 for the 20% and 50% split respectively. In our Figures we show 
the averages and standard deviations over those 100 splits. 

Figure 1 shows precision and recall estimates (Figures 1(A) and 1(B)) and 
their differences (Figures 1(C) and 1(D)) for Reuters benchmark, category = 
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Fig. 2. Precision and Recall estimates for acq’ and ‘money-fx’ the second and third 
largest categories in Reuters benchmark. Experimental settings were C' = 800, m = 
4801 and p = 2. The systematic bias in ApteTest is clearly visible for category = ‘acq’, 
e.g. we observe the better precision for ApteTest than for the training set. 



‘earn’ (30%), which is the largest category. The systematic bias in ApteTest 
for this category is clearly visible, e.g. contrary to expectations, we observe the 
better recall for ApteTest than for the training set. In order to examine if such 
bias exists for other categories. Figure 2 plots the precision and recall estimates 
for the second and third largest categories, category = ‘acq’ and ‘money-fx’. 
Experimental settings were C = 800, m = 4801 and p = 2. Note that the 
systematic bias in ApteTest is clearly visible for category = ‘acq’, e.g. we observe 
the better precision for ApteTest than for the training set. No such bias is evident 
for category = ‘money-fx’, although the empirical estimates for this category are 
very optimistic indicating that this is one of the hardest of the top 10 Reuters 
categories [3]. 

For learning constant C = 800 we observe a well defined phase transition at 
0 = 1 for recall and at 0 = —1 for precision. For the learning constant C = 10 
those phase transitions are smoother, however the gap between estimates for the 
training and test sets is much smaller (compare Figure 1(D) with Figure 3(C)). 

Figure 3 plots the precision and recall estimates for two different SVM set- 
tings, and the corresponding differences between the training and test estimates 
when different amounts of training data is used. From the differences graph for 
training size = 20% and 50% in the two Figures 3(C) and 3(D), it is evident 
that the greater the amount of training examples, the closer the training esti- 
mate is to the test estimate. Figure 3(A) and 3(C) are for the settings C = 10, 
p = 2 and category = ‘earn’, while Figure 3(B) and 3(D) are for the settings 
C = 800, p = 1 and category = ‘crude’. Note that the differences in estimates 
when the minority class size is 4% is much larger (Figure 3(B) and 3(D)), again 
highlighting the need for estimates to be based on a large number of positive 
training examples. 
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(A) Training Size = 20%, C = 10, p = 2, Target Size = 30% draining Size = 50%, C = 800, p = 1, Target Size = 4% 




(C) C = 1 0, p = 2, Target Size = 30% 



(D) C = 800, p = 1 , Target Size = 4% 





Fig. 3. Precision and Recall estimates and the corresponding differences for the largest 
category ‘earn’ (30%) and the fifth largest category ‘crude’ (4%) in Reuters benchmark 
for two different SVMs. 



4 Recognition of NIST digits 

In this section we present results of a test of a support vector machine with a 
non-linear kernel on a popular benchmark data set of handwritten digits^ . Each 
data entry is a 784 pixel grey scale image, with a pixel represented by an integer 
between 0 and 255. We have decided to use the fourth order polynomial kernel 
k{xi,Xj) := {1+Xi'XjY , which was the best performing for this data in [l].In this 
case data is separable hence we have decided to use hard threshold SVM. Such a 
machine can be obtained by optimisation of (2) with loss L{^) := 1 if ^ > 0 and 
L = 0, otherwise, and sufficiently large C, but there is also a possibility of using 
dedicated algorithms instead. In our research an algorithm described in [9] has 
been used. 

In Figure 4 we give a sample of results obtained. Two target tasks have been 
set: one to retrieve images of digit 0 and another to retrieve images of digit 4. 
The training was on 30K samples randomly selected from 60K in the training 



^ Available from http://www.research.att.com/~yann/exdb/mnist/ 






Learner’s Self-Assessment: A Case Study of SVM for Information Retrieval 



255 



Recall and precision for NIST digits: Target class "0" 



Recall and precision for NIST digits: Target class "4" 





Fig. 4. Plots of averages precision and recall estimators for NIST digit data test. Re- 
sults of a single run are shown for the task of discrimination of digit 0 from the re- 
maining nine digits. The SVM was trained on 30,000 randomly selected samples from 
the standard training set of 60,000 and then tested on the standard test set of 10,000 
patterns. Note that for the digit 4, recall for threshold 6 > 1 is systematically lower for 
the training set, then for the independent test set. 



corpus, and the test was on the standard set lOK samples from different writers. 
In the case of 0 the trained network used 1038 support vectors, made 41 errors 
on the test set and achieved > 80% of the optimal margin. Similarly, in the case 
of 4 the trained network used 1396 support vectors, made 71 errors on the test 
set and achieved > 80% of the optimal margin. 



5 Theoretical Explanation 



It is convenient to introduce the notation Z := iVx{±l} and := : = 

..., for the training m-sequence from Z™ = (iVx{±l})™. We 

shall always assume that our target or minority samples have label yi = +1 and 
reserve the label —1 for the majority (background) samples. 

We assume that there is given a probability distribution P on the input space 
Z = X X {±1}. The true or expeeted reeall of a predictor f : X ^ W is defined 
by the conditional probability: 



Rec[f,e] := V[f{x) > e\y = 1] 



for every 9 We shall also study a modification the empirieal estimator (6): 



i?ec^™ if, 9] := 



|{» ; f{xi) > 9 kyi = l}\ 

nil 



(8) 



where mi is an integer 0 < toi < to. The most interesting is the case when 
TOi PS TO 7Ti, where tti := V[y = 1] denotes the prior of the target class. For large 
TO, this modification of recall makes little difference. 
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In the theorem below we use the notation Z|™ ^ for the subset of all sequences 
((Xi ,yi)) G having exactly toi labels yi equal to 1. We recall that 
f^rr^ denotes the SVM obtained as the minimiser of the regularised risk (2) . The 
theorem covers both, the “soft margin” case where data is either separable or 
not, and the “hard margin” case, which is applicable to separable data only [1, 2, 
13, 14]. In the former case the loss function L(^) in (2) is assumed to be convex, 
continuous and such that L{^) > 0 for all ^ and L{^) = 0 for all ^ < 0. As 
mentioned above, if data is separable with a margin p > 0, then the hard margin 
case corresponds to solving (2) with L(^) := 1 for p > 1 and = 0, otherwise, and 
the constant C > 1/p. 



Theorem 1. For every 9 > 1 and every integer 0 < mi < to; 



G Z 



m — 1 
\nii-l 



i9) 



We shall outline the proof in the subsequent subsection. Now we concentrate 
on the discussion of the above result. 

If TOi PS [tottiJ 1, then ~ 1 and 

E[i?ec[/^™-i,0] I G « E[i?ec[/^™ , 0] | irf™ G Z,™ J . 

Hence with “high” accuracy (9) implies 

E[i?ec,V [f^-,0] I G J « E[Rec[f^^,9] \ x^^ G Z,™ J 

and subsequently 



E[i?ec^™ [/^„,p]] ~E[i?ec[/^™,0]] 

since E(U|mi-m 7 n|<me ^|mi) ~ 1 for an e > 0 and sufficiently large to. This can 
be interpreted as a theoretical corroboration of experimental observation that 
there is no systematic bias in the empirical estimator of the recall for 0 > 1. 
In other words, empirical recall is sometimes pessimistic, sometimes optimistic 
estimator of the true recall, but on average neutral. Needless to say, that our 
experimental results in Figures 1, 2, and 3 are consistent with this statement. 



5.1 Outline of the proof of Theorem 1 

The proof is based on leave-one-out estimator and involves a number of steps. 

A. The SVM /^™ obtained as the minimiser of the functional (2) is unique. 
This can be derived from the strict convexity of functional / ha R^<r^ [/] [14] . 

B. Let denote the training sequence (1) with the Hh training 

instance removed. Then 



yifxt”'v{xi) < yjxt^'ixi) (V^™ G Z™). 
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The crux is to show that if this inequality does not hold, then R^<r^ < 

[/^™] , which contradict the uniqueness of the minimiser /^™ (since /^™\; 
in such a case). 

C. If Vifixi) > 1, then 






This can be derived from the convexity of the functional ^ ha L(^) in the re- 
gion ^ > 0 and the uniqueness of SVM solution and the Khun- Tucker conditions 
[1,2,14], 

D. Let us consider the leave-one-out estimator of the number of recalled 
patterns: 

m 

>0 kyi = l], 

for every 9 £ W, where I[ • ] denotes the indicator function equal to 1 if its 
argument is true and 0, otherwise. Then from the last two Steps we get 

(10) 

for every x^™ £ Z|™ ^ and 0 > 1. 

E. We show a variant of Luntz-Brailovski theorem [14]: 

I 4 ™ e zpiJ = ^!^ElReclf^..-,.e] 1 5^"- e Zf„"-i,](ll) 

for every 9 > 1, where tti = V[y = 1] is the prior of class 1. 

The proof involves a chain of transformations: 



E [ 



Leixtn 



mi 

1 



e ZZ,] 



miP(ZZ^) 

1 

nil — 1 

miP(ZZ^) 



(mi - 1 ) p(ziz;Z) 

miP(ZZ^) 

(nil - 1 ) 



> 0 ^ yi = l]dP(zi)...dP(Zm) 

« pm 

' / (/ >0 kyi = l]dP(zi)) 

dP(zi)...dP(zi-i)dP(zi+i)...dP(zm) 

J ■ ■ ■ J Rec[f^m-i , 0]dP(zi)...dP(zi-i)dP(zi+i)...dP(zm) 



E 



'i?ec[/^™-i,0] 1 xt^-^ G 



niTTi 



E[Rec[f^^-^,0] I xt^-^ G 



For the last equality we use the relation E(^|™ J “ tti)™ ™L 

The equation (9) of Theorem 1 follows from (10) and (11). 
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6 General Discussion 

Below we analyse the various issues highlighted by our experimental findings. 
Early stopping. The (preliminary) experiments reported in this paper involved 
generation of thousands of different SVMs, which forced us to use early stopping 
heuristics. This could affect our results though we do not believe that it is sig- 
nificant. Our view is that with more exact solutions, the observed phenomena, 
such as the phase transition, should become sharper. 

Link to snpport vectors. We recall the a data point (xi,yi) is called a 
support vector if the coefficient a, in the data expansion (3) of the SVM is non- 
zero. From the Khun- Tucker conditions it follows that for the (ideal) minimiser 
of (2) this is equivalent to yif{xi) < 1. Thus we can summarise our empirical 
results in the following “rule of thumb”: the empirical estimates of recall are 
“accurate” as long as they are based on non-support vectors, however, once the 
support vectors are involved, the estimates are optimistically biased. 

Phase transition. The “sudden jump” (a “first order phase transition” in 
physics parlance) in the training estimates of the recall for threshold 0 1 is 

directly linked to the concentration of support vectors with positive label around 
value f{xi) = 1. Note the for the small values of training constant C those jumps 
become smoother, as support vectors are less concentrated (cf. Figures 3). 

Correction for snpport vectors. The training estimates of the recall for 
0 < 1 can be potentially improved using various theoretical corrections and 
lower bounds for leave-one-out estimator [5,6, 11, 15]. Joachims in [6] has inves- 
tigated recall for Reuters benchmark at threshold 0 = 0, hence he overlooked 
the phenomena studied in this paper. 

Theoretical corroboration for precision. The extension of our theory 
to the case of precision is harder, since it’s definition involves a denominator 
dependent on threshold 0 (cf. Eqn.7). However, rough approximations can be 
derived for this case from our result for recall. 

Linear vs. qnadratic penalty. In our experiments, we have used two soft 
margin SVMs {p = 1,2) and additionally, a hard margin SVM for the NIST digit 
recognition task, and our observations regarding estimators are valid across all 
of these SVMs. Due to space limitations, we have not addressed the issue of 
performance differences due to different SVMs. 

Break-even point. Interestingly, break-even point, i.e. the point at which 
recall is equal to precision, is roughly the same for the test set for all settings 
(cf. category = ‘earn’, in Figures 1(A), 1(B) and 3(A)). This raises the utility 
of the extensively used break-even point as a text categorisation performance 
evaluation measure. 

Bias in the standard test sets. Result in Figure 4 for digit 4 and in 
Figures 1 and 2 for Reuters show that standard test sets for both data collections 
used are biased in a way that empirical estimates of both retrieval and recall for 
0 > 1 are pessimistic. In the context of Theorem 1 the explanation is that the 
standard training and test sets for these benchmarks are not iid sampled from 
the same distribution, which is consistent with the way they have been created. 
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NIST digits. More systematic study of NIST digits benchmark in the 
context of estimators considered in the paper will be reported soon, including 
both soft and hard margin SVMs. The result reported in [1] that very accurate 
soft margins SVMs trained on the whole training set of 60K patterns use 1 
to 2 thousand of support vectors. Hence, the bulk of the training data (more 
than 95%) are non-support vectors, for which training estimates should be very 
accurate. However one should remember, that this may be obscured by the bias 
which exists in the standard NIST digit test set (cf. Figure 4). 

Practical implications. Due to the prevalent belief that estimates based on 
training sets are notoriously optimistic, a common practise in machine learning 
is to use tests on a validation set rather than on the training set in order to assess 
quality of the classifier. In practice this is fine as long as the training data is 
abundant, however this is not always the case. Moreover, if we try to subtly fine- 
tune our classifier involving multiple tests on our validation set, then we tacitly 
introduce additional bias and our validation set will not be truly independent 
any more. The point we want to make here is that in practise, the training set is a 
valuable and scarce resource, which should be utilised for assessment of classifier 
performance whenever possible. 



7 Conclusion 

The paper has introduced a novel topic of reliable performance estimation from 
training set. The preliminary experimental results presented here confirm a theo- 
retical prediction that the support vector machine performance on training data 
is a reliable indicator of its performance on independent test data, in the region 
where allocated score is larger than 1 in magnitude. This has been demonstrated 
for practical benchmarks of Reuters news-wires and NIST hand-written digits. 

We have shown that empirical estimates of recall and precision from the train- 
ing set can be of high accuracy, with errors below a few percent. For Reuters 
benchmark these results have been demonstrated for relatively small training 
sample sizes, an order of magnitude smaller than the dimensionality of the fea- 
ture space. 

We have demonstrated that the standard test sets for popular Reuters news- 
wires and NIST hand- written digit benchmark have systematic bias, making 
for instance, the performance on the test set better than on the training set. 
Such an anomaly can obscure some subtle properties of learning machines, and 
researchers should exercise care while dealing with these data sets. 

A number of new questions have been raised by this research and should be 
investigated further. In particular, future research should investigate other data 
sets including other real data sets and carefully designed artificial data. Also 
kernels other than the linear kernel studied in this paper should be tested. In 
addition, the well pronounced phenomenon of phase transition in accuracy of 
estimation that has been observed consistently should be investigated further. 
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Abstract. This study attempts to find a theoretical basis for the development of 
digital cities. The ultimate function of a digital city is to support navigation in 
an environment. Navigation builds on meanings of the environment resulted 
from semiosis processes. These processes may affect each other when combined 
so that become able to communicate. Communication is performed with signs 
and depends on the behavioral co-ordination of communicating parties. The 
classical theories do not satisfactory explain communication. The paper intro- 
duces a new model of communication appropriate for computer treatment. 



1 Introduction 

In the last few years, a great deal has been written in the academic and popular litera- 
ture about the extension of the urban space-economies and social institutions into the 
new “virtual areas” called “digital cities” [11]. A digital city is usually understood as a 
collection of digital products and information resources deployed for a collaborative 
use. The principal mission of a digital city is to provide services aimed at facilitating 
social and/or spatial navigation in a virtual (e.g. “information”) or physical (e.g. geo- 
graphical) space. Typically, a digital city comprises a large distributed database of 
heterogeneous documents of various digital genres - texts, maps, animated images, 
and the like. It uses a computer network and a client-server protocol and allows for 
browsing across digital documents through appropriately ordered hyperlinks to search, 
retrieve, and manipulate information as needed. Networking and information retrieval 
are often pointed to as key issues for the development of digital cities. 

As part of the information delivery, a digital city usually assists in interpretation of 
the results of a user’s query. To facilitate understanding the results (or even the query 
itself, as in exploratory search), a digital city may provide the user with the related 
context or employ an illustrative metaphor or suitable analogy. Besides, it may utilize 
the user’s feedback (or some data about the user) for adjusting retrieving or displaying 
the obtained information to make it more accessible and meaningful to the user. An- 
other important issue that is thus often discussed by digital city developers is human- 
computer interaction. 

Reflecting the present understanding of the concept of digital city (which is, how- 
ever, far from being unified), reference [10] reports a number of implemented or pro- 
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jected digital cities. Different authors approach the task of the development in a simi- 
lar manner, defining a digital city through its functions or else through its contents 
with vague terms, such as “useful information,” “cyberspace,” “community,” and the 
like, and with ad hoc design decisions. These decisions may be of arbitrary relevance 
to the users’ needs, and they may have unpredictable (especially, on a long-term scale) 
social, technical, and economical consequences. Owing to the speculative definitions 
of basic concepts, with which they are defined, reported digital cities readily loose 
their identities and become almost indistinguishable from other digital products, such 
as map repositories and digital libraries. The study of digital cities obviously lacks 
conceptual clarity, and hence, the developed digital cities are not necessarily useful 
and usable. Another drawback common to the current implementations is that al- 
though it is usually admitted by default that a digital city is set up for a group of users 
rather than for a single user, the reported projects were focused on and addressed spe- 
cific aspects related to the individual needs (e.g. planning a sightseeing tour) and per- 
sonal adaptation (e.g. of the interface). The issue of the appropriateness of a digital 
city to a particular society has not been explored. Even more obscure remains the 
question of possible mutual influences of a digital city and the society of its users. All 
this could be a serious reason to question the very expediency of the digital cities. 

Our work first seeks to develop a theoretical basis for the creation of digital cities. 
Through the study, we examine a digital city as an organization of interacting social 
agents and propose a semiotic model explicating communication of the agents. It is 
argued that the semiotic approach not only allows for building a powerful theory of 
communication, but suggests important implications for the digital city development. 

The rest of the paper is structured as follows. Section 2 investigates the concept and 
identifies communication as a definitive function of a digital city. Section 3 discusses 
different models of communication. The core of the paper then follows, introducing a 
semiotic model of communication in Section 4. The model helps us better understand 
the dynamics of communication. Some ideas on how to apply the theoretical findings 
are reported in Section 5. Finally, Section 6 outlines related work and summarizes the 
study. 



2 Navigation with a Digital City 

The general task of navigation in an environment* can be described as a four-stage 
iterative process that includes [20]: 1) perception of the environment, 2) interpretation 
of the perception, 3) deciding whether the current goal has been reached, and 4) ap- 
propriately adjusting the behavior. Among the four stages, the last two have obviously 
a subjective character, whereas the other two depend on “objectively” available - 
sensed - information about the environment. Perception first receives and represents 
raw sensory data and provides for the further interpretation by combining (i.e. putting 



* For the purpose of this study, we will not distinguish the environment of a digital city as 
surroundings from the environment as navigation space. In both cases the environment is 
“that, which is not the digital city,” and the latter is often part of the former. 
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into a context) the obtained representations. When information available through the 
senses is not enough for establishing or re-establishing meanings (i.e. “knowledge”) of 
the environment necessary for the decision-making, the navigator may ask for help a 
guide - someone, who (presumably) knows more about the environment. A digital city 
may be seen as such a guide: it works to enhance the navigator’s sensing capabilities. 

Perceptual Control Theory [17] proposes an explanation of the control mechanism 
for the navigation process. The theory tells us that a perceiving entity seeks to bring 
the perceived situation to its goal (or preferred state) by utilizing negative feedback 
from the environment: if the situation deviates from the goal, the entity acts and 
adapts, possibly changing its own state and the state of the environment, and the new 
situation is again sensed and estimated in respect to the goal. The loop repeats and 
keeps the system in a stable goal-directed (or motivated) state. A digital city can, in 
principle, sense its environment directly (e.g. through cameras and transducers - in the 
case of spatial navigation). There is, however, no other way for it to determine the 
context and, hence, semantics necessary for making the sensed information meaning- 
ful, but (ultimately) by drawing on the expertise of its users and utilizing feedback 
from them. In this aspect, the users (together with their knowledge) are constitutive 
parts of the digital city that should then be considered a social system. 

Each user’s knowledge is a subjective reconstruction of the locally and selectively 
perceived environment. No user possesses the perfect knowledge, but being connected 
by means of the digital city, the users can interact with each other, thus accessing to 
the collective “knowledge” - once sensed or created information about the environ- 
ment - that is usually far more complete and encompassing than the knowledge of a 
solitary user. There can be different social interactions between users of a digital city, 
but most typical interactions are the following [7]: communication of the goal (or 
motivation), communication of the relevant knowledge, and the location of a source 
(e.g. another user) of relevant knowledge. Given the diversity and apparent subjectiv- 
ism of each user’s knowledge, to understand how a digital city should operate, one 
must clarify (at least) three principal issues: 1) what is communication (and how it 
goes on), 2) what is(are) the role(s) of a digital city in communication, and 3) how 
communication reconciles the diversity of the subjective views of reality. 



3 Modeling Communication 

There are two major approaches to understanding and modeling communication proc- 
esses [6]: statistical “signal-oriented” and interpretive “meaning-based.” The Shan- 
non-Weaver theory with its conveyor tube model [19] represents the former class of 
the approaches. The theory (and the model) assumes the following (see also Figure 1): 

• There are the (information) source and the target parties involved in commu- 
nication that is seen as the “exchange of information” between them through 
a channel. It is generally possible for a third party - an observer - to judge 
about the correctness of information, whether sent or received. 

• The source is active and initiates communication. 
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Fig. 1. The conveyor tube model 



• The channel is passive and unstructured: “useful information” can be ex- 
tracted from the transmitted signal, provided it statistically differs from 
(physical or semantic) noise. 

• At the target side, received information is utilized by embedding it into pre- 
defined information structures. 

Although criticized by many (see [24]), the Shannon-Weaver theory currently 
dominates over any other theory of communication in terms of its conceptual devel- 
opment and significance for practice. Among the most noticeable shortcomings of the 
conveyor tube model, we would mention its inability to explain the phenomena of 
(mis)understanding, lying, and psychological effects of verbalizing thoughts and emo- 
tions. More significant (though evident) for us is, however, the fact that the Shannon- 
Weaver theory can contribute little, if anything, to clarifying and coping with the 
complexity of communication in a social context [5]. Often, neither the target nor the 
source can uniquely be identified in the case of digital city (rather, there can be many 
sources and targets, which may or may not coincide), and it is unclear what is the role 
(apart from the straightforward “information channel” role) a digital city can play in 
communication. This makes the statistical approaches ineffective for the study of a 
digital city as a social system. 

Striving to compensate for the limitations of the conveyor tube model, a number of 
interpretive models of communication have recently been developed (e.g. [1, 14]). 
Rooted in the human sciences, an interpretive model postulates that: 

• There are no physical target and source but interpretants - that, which 
follow semantically from interpretation processes. 

• The observer cannot judge about the correctness and incorrectness of in- 
formation: these two are subject to individual interpretation. Besides, there 
is no direct access to reality, and the decisive notions, like “truth” and 
“false,” are only socially determinable. 

• The target, rather than source, is active. 

• Not mere information, but meaning is produced, sent, and interchanged in 
the course of interaction between a carrier (e.g. text or sound) and culture. 

Operating with meaning, interpretive models are often defined in terms of semiotics 
- a science about signs, which (in the Peircian interpretation [16]) departs from the 
naive treatment of signs as utter signifiers of their objects by introducing a third aspect 
of the representation process - the interpretant - that corresponds to the meaning con- 
necting a sign with its object. While a more detailed introduction into semiotics will be 
given in the next section, declaring that a sign can have many different meanings de- 
pending on the socio-cultural context should suffice to understand Figure 2, which 
presents an interpretive model of communication proposed in [1]. 
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Fig. 2. The interpretive model 



The model sees communication as the interaction of two or more psychic systems 
with a shared communicative - social (cultural) - system, in course of which signi- 
fieds (products of the psychic systems) become signs (products of the social system), 
which have socially (culturally) determinable meaning. In this view, a digital city is to 
play the role of the social system, and the interpretive approach can (and does) expli- 
cate many of the communication phenomena overlooked in the conveyor tube frame- 
work (see [1] for details). At the same time, however, the model appears too specula- 
tive to be useful in practice: it says little about mutual influences of the social and 
psychic systems (and, hence, about the dynamics of these systems), yet leaving one 
confused by specifying the functioning of a psychic system in terms of “signifieds” 
understood as either “objects” with which meaning is expressed (e.g. sound-waves) or 
“objects” of interpretation (i.e. that which is expressed). This, as well as the poorly 
matched formalization of interpretive models suggests us that to meet the modeling 
needs of the digital city development, a new approach, which would assimilate the 
advantages but remove the shortcomings of the different communication theories, 
needs to be devised. 

4 Towards a Semiotic Theory of Communication 

4.1 Insight from Systems Theory 

From a behavioristic viewpoint, an individual engaged in navigation develops an in- 
ternal representation using those distinctions - “signs” - of the environment, which 
turn up solutions to the problem that are successful behaviors [3]. Signs of such a 
representation arrive as “tools for indication purposes” [18]. When met in an environ- 
ment, these signs (i.e. the distinctions they stand for) serve to orient the navigator, 
regardless of their other possible (or “actual”) meanings and roles. The navigator is 
not really interested in “getting to the truth,” but in knowing what happens or what are 
possible consequences - expectations, when a sign is encountered. In this aspect, signs 
come up as signifiers of once successful interaction between an individual and an 
environment: a sign is an orientational “pointer” to not merely an object standing in a 
referential relation with it, but to the outcome desired for the user (e.g. “turn left after 
the sign-post” but not “follow the sign-post”). Signs can be considered “anticipations 
of successful interactions of referral” [18], emphasizing their origin and expected 
influence on behavior. 
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One can show that the behavioristic view of the grounding process of forming signs 
is just a specialization of the classical view that defines information as “a difference 
that makes a difference” to the interpreter [2]. The specialized view, however, makes 
it difficult to explain communication in a digital city as mere exchange, whether of 
information or signs, or meaning. Indeed, in the case of navigation with a digital city, 
not objective reality but subjective experience underlies the creation of signs. The 
navigator cannot frequently succeed with developing an interpretation of a sign re- 
ceived through communication by simply referring the sign to the observed environ- 
ment - the navigator’s personal experience has first to be coordinated (up to a point) 
with the experience of the “creator” of the sign. The latter requires something else 
than just sending and receiving information (signs, meaning, etc). 

An advanced explanation of communication that includes aspects of information 
(sign) exchange as well as behavioral coordination between autopoietic systems can 
help us shed more light on the phenomenon. An autopoietic system is a dynamic sys- 
tem maintaining its organization on account of its own operation: each state of such a 
system depends on its current structure and a previous state only [13]. The structure of 
an autopoietic system determines the system possible (i.e. self non-destructive) be- 
haviors that are triggered by its interactions with the environment. If the system 
changes its state, causing changes of the structure, without breaking autopoiesis, the 
system is structurally coupled with the environment. If the environment is structurally 
dynamical (e.g. is itself an autopoietic or self-organizing system), then both the system 
and the environment may mutually trigger their structural changes, sustaining the 
system’s self-adaptation. When there are more than one autopoietic system in the 
environment, the adaptation processes of some of the presented systems may become 
coupled, acting recursively through their own states. All the possible changes of states 
of such systems, which do not destroy their coupling, create a consensual domain for 
the systems. Behaviors in a consensual domain are mutually oriented. Communication, 
in this view, is the behavioral coordination resulting from the interactions that occur in 
a consensual domain (see [5] for details). This definition can be used to refine and 
improve the interpretive model described in the previous section by introducing the 
dynamic aspects. To do so, let us first make clear the terminology. 

4.2 Terminology and Basic Assumptions 

In Peirce’s formulation [16], semiotics studies the process of interaction of three sub- 
jects: the sign itself - the signifier, its object - that which is signified by the sign, and 
the interpretant - the meaning made of the sign. No sign is directly connected to an 
object: signs acquire meanings only when they are re-represented in (referred to) a 
system of interpretance that is a sign system, which creates a context (e.g. by estab- 
lishing relations on signs). Naturally, the same sign may have different meanings 
while signifying different objects, or the same sign may have different meanings while 
signifying the same object, or different signs may have the same meaning while signi- 
fying the same object, and so on. Designated semiosis processes determine the mean- 
ing(s) of a sign in all the specific situations. 

A semiosis process is the process of establishing the meaning of some distinctions 
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in an environment that entails representation and re-representation of these distinctions 
over levels of interpretation (that form different systems of interpretance), where 
every level is governed by and adopts certain developmental rules and axioms called 
norms. The norms reflect different aspects of human behavior and can be classified 
into five major groups [21]: perceptual (to respond to peculiarities of sensing), cogni- 
tive (to deal with cultural knowledge and beliefs), evaluative (to explain personal 
preferences, values, and goals), behavioral (to delineate behavioral patterns), and 
denotative (to specify the choice of signs for signifying). Semiosis comes as a natural 
organizational process: it organizes signs in a partial level-hierarchy by ordering them 
so that signs of objects (which can also be signs) of level N-1 for processes and struc- 
tures of level Nh- 1 are placed on level N. The lowest-level signs, e.g. (manifestations 
of) physical objects, behaviors, emotions, and the like, are perceived or realized 
through their distinctions and may get a representation at an “intermediary” level of 
norms, reflecting interpretive laws of a higher, experiential and environmentally 
(physiologically, socially, technically, economically, etc.) induced level, which deter- 
mines “meanings” for the lower-level signs. This simplified three-level structure cor- 
responds to a single semiosis process, whereas navigation in an environment engen- 
ders multiple semiosis processes and results in the creation of a multi-level sign sys- 
tem with a potentially infinite hierarchy of dynamic interpretive levels [12]. 

A user of a digital city deals with a fragment of the global, i.e. loosely shared 
through the environment by all the users, system of signs. The fragment is, however, 
distinctively ordered in an interpretive hierarchy peculiar to the user’ s experience and 
norms adopted. Hierarchies created by different users may be different in terms of the 
order and the coverage, and they may run on different time-scales. Unlike the case of 
individual navigation, where perceived and conceived signs may need not be articu- 
lated - “externalized” - explicitly, the operation of a digital city neatly builds on 
communicative use of the global sign system representing the environment and the 
digital city itself. This sign system is a projection of a consensual domain of the com- 
municating parties onto the domain of physical objects and phenomena that comes as a 
language defined in a very general “behavioristic” way. The digital city “describes” its 
environment with such a language, which has a syntax reflecting the organization of 
the environment, semantics establishing meanings of the environment, and pragmatics 
characterizing the effect of the language use. The language is to reconcile the subjec- 
tivism and diversity of individual perceptions through communication. 

4.3 Semiotic Model 

Let y(t) be the state of an autopoietic system at time t, and x(t) be the vector of states 
of the system parts, which constitute its structure. Following the definition of an auto- 
poietic system [13], we can write: 

Jx(i + l)=f(x(t), y(t)), 

ly(r + l) = g(y(r),x(n-l)), 

where f and g are some functions, specifying the behavior of the system parts and the 
system as a whole, respectively. If f and g are properly specified, these equations 
allow one to characterize the dynamics of the system. 
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By the system-theoretic explanation (see Section 4.1), communication is an inter- 
action between autopoietic systems taken place in a dynamic environment. The psy- 
chic system of a human is an instance of an autopoietic system [13], a social system is 
an autopoietic system [23], and a digital city, seen as a “digital realization” of a social 
system, may be considered an autopoietic system, too. The interpretive model of 
communication described in Section 3 can then be re-formulated. 

Let t be a discrete time-mark corresponding to a single semiosis process 
S={Object{t),Sign{t),Meaning{t)} in a partial time-sequence of communication 
5^). Let us also assume that the abstract notion of psychic state is equiva- 
lent to the totality of subjectively valid interactions (behaviors), and the notion of 
social state is equivalent to the totality of socially valid (i.e. appropriate for communi- 
cation) signs. In line with (1), the dynamics of a psychic system involved into com- 
munication can semiotically be characterized as follows: 

|Objects(t + l) = Externalizing(Objects(t), PsychicState{t)), ^ 

\PsychicState{t -t- 1) = Interpreting(T’^yc/^^c5^fl^e(^), Signs(t -I- 1)), 

where “Externalizing” and “Interpreting” are some parametric relational mappings 
that specify the uttering process and the (personal) understanding process, respec- 
tively. “Objects” is a state vector representing the behaviors, which are (expected to 
be) individually effective (by feedback), and “Signs” is a state vector representing the 
behaviors socially effective. 

Analogously, for the social system: 

[Signs(t -t l) = Externalizing(Signs(t), SocialState{t)), 

[SocialState{t -I- 1) = Interpreting(5'oda/5'tate(t), Objects(t -I- 1)), 

where “Externalizing” and “Interpreting” are some parametric relational mappings 
that specify the processes of social “filtering” and “adaptation,” respectively. 

Equations (2a) and (2b) permit us to fully characterize the communication process, 
provided the corresponding relational mappings have thoroughly been defined. It is 
important to note that neither “social” nor “personal” time is represented explicitly in 
the model, but by the effect they have on the semiosis process. The model states that 
(also see Figure 3 for a graphical illustration): 

• The correctness and incorrectness of information are subject to both, indi- 
vidual and social interpretations. 

• There are no “meanings” in the social system, and there are no objects in 
the psychic system. Semantics of the communication language is a result 
of a social convergence of understanding the environment, while its syntax 
is a social convergence of the interactions (behaviors). The semantics and 
syntax may overlap. 

• There is no target or source: communication is seen as a partial time- 
sequence of interdependent (recurrent) semiosis processes. 

• The social system plays the role of an active communication channel: it 
filters communications out of interactions (behaviors) and buffers percep- 
tion against “noise” - processes and phenomena not immediately related 
to the communication. 
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Fig. 3. The semiotic model 



• “Meanings” change owing to perturbing a psychic system with signs; signs 
change owing to perturbing the social system with objects (behaviors). 

A digital city creates a social system and can now precisely be defined as an auto- 
poietic organization of social agents communicating via the digital medium, such that 
every social agent is a realization of a semiosis process caused by navigation taken 
place in a common (for all agents) environment. 



5 Example 

The main problem of the semiotic model introduced in the previous section is that it is 
difficult to implement. Indeed, the relational parametric mappings “Externalizing” and 
“Interpreting” of (2a) and {2b) are not fixed but depend on the internal “hidden” dy- 
namics of the social and psychic systems that, although can in principle be estimated 
for a period of time by observation, are generally unpredictable, as unpredictable is the 
dynamics of any autopoietic or self-organizing system [4]. While the latter problem 
can, to an extent, be addressed within approaches of systems theory, such as syner- 
getics, chaos modeling, and cellular automata (see [8] for a survey), in this section we 
will discuss an interface design that is a partial realization of the semiotic model made 
under certain simplified conditions. 

Let us first confine our consideration to a single individual interaction (and, thus, to 
one psychic system) in the sequence of semiosis processes representing a communica- 
tion. Substituting the second equation of (2a) into the first results in the following: 

/ \ ( Objects(t), ^ 

Object(t + Ij Externalizing , , . , , (3) 

y\nter^xe\mg[PsychicState[t - Ij, Signs(t jj 

Let us assume that during communication, both the psychic and the social systems 
do not change their states, and that the socially valid interactions (i.e. “Signs”) do not 
change. Let us then set the beginning (for the psychic system) of the communication at 
time t (note that generally, Objects(t) A 0). As PsychicState{t-\) appears, in this in- 
terpretation, nothing but relations and constraints imposed on (a subset of) the signs 
(which are supposed not to change under the postulated “zero dynamics”), (3) can be 
reduced to the following form: 
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Fig. 4. Modeling communication with an intelligent interface 



Object(f + 1) < f(obiect40) signs(f), (4) 

where |i is some set of rules controlling the selection. 

Figure 4 shows an interactive Web-page filtering process with an “intelligent” in- 
terface, which realizes the model specified by (4). The user starts the interaction by 
inputting a query to a search engine that then produces a (usually vast) set of digital 
documents (e.g. texts or hyperlinks). The system then suggests the user to evaluate the 
relevancy of arbitrary chosen elements of the set in order that some p'is build based 
on the relevance feedback, p' is applied to filter the retrieved documents, and the 
procedure may be repeated until a satisfactory p is found. 

The interface was originally developed by one of this paper authors for the general 
purposes of information retrieval. It demonstrated the ability to facilitate the process 
of man-machine communication by reducing the number of interactions necessary to 
obtain information of interest. A more detailed account on the implementation can be 
found in [15]. 

Although the simplified treatment of the semiotic model appears reasonable and 
natural in many cases (the probability of changing the social state is far less than the 
probability of changing the psychic state [12], and the latter probability is quite small 
when the goal - see Section 2 - does not change), a weak point of the implementation 
is that the user is required to know (at least, to a degree) the communication language, 
i.e. to know some of the relevant elements of the “Signs” to initiate communication. 
This problem could be subdued with a realization of the second part of the semiotic 
model - the equations (2b) - that we plan to accomplish in the near future: an ad- 
vanced version of the interface is to be deployed in a digital city. (It is easy to see that 
(2b) can be reduced to a form similar to (4), e.g. Sign(t -I- 1) < — Objects(t), 

under certain conditions.) 
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6 Related Work and Conclusions 

Our semiotic model of communication resembles the one of “dynamic semiotics” 
proposed in [1], although our ultimate goal is more pragmatic: to develop an effective 
digital city rather than to explain communication phenomena. Besides, the approach 
that we advocated in this paper is, in our opinion, theoretically better sound as it relies 
on the results independently obtained in several disciplines, such as cognitive and 
social sciences [17, 7], semiotics [12], and complex system theory [9]. The work [23] 
on Niklas Luhmann’s theory of social systems is also closely related; however, we 
have a different vision of communication and apply a different apparatus to explicate 
it. We do not concentrate on the generic social phenomena but study their “projection” 
and effect on the digital media. One may find this work as an effort to somewhat 
widen but specialize the idea of modeling the society as the “global superorganism” 
[9] - we believe that our research has fewer only intuitively understood points and, 
therefore, better suits for the developmental needs. Among information systems re- 
lated to our study, we should mention the so-called recommender systems (see a sur- 
vey in [22]), which usually utilize some specific method (e.g. data mining) or “com- 
mon sense” knowledge rather than a communication theory or model. All these works 
have influenced our research, and [1] actually inspired us to apply the apparatus of 
semiotics. Unfortunately, we have not found reports which have theoretically ap- 
proached the development of digital cities. 

Among the results of the study, we would first like to point to the clarification of 
communication as a socio-cognitive phenomenon, and to the determination of the role 
of a digital city as a socio-culturally controlled communication channel that is, on the 
other hand, a “realization” of the social system in the form of a language. Some impli- 
cations of these results for the design of digital cities are as follows. A digital city 
should be able to utilize relevance feedback from its users to co-ordinate communica- 
tion at both the personal (by adjusting to the social system) and the social (by adjust- 
ing to the user) levels. This will reduce the cost of communication by decreasing the 
number of necessary interactions. The semantics of the communication language (and, 
therefore, the structure of the user-system interactions) is determined by the users of 
the digital city rather than by some “objective” laws or “universal” ontologies. The 
digital city should be able to adjust the semantics as it evolves. Finally, more studies 
on the semiosis of communication are necessary, as well as the design of new tech- 
niques to implement the semiotic model. 

The presented work is part of the Universal Design of Digital City project in the 
Core Research for Evolutional Science and Technology programme funded by the 
Japan Science and Technology Corporation. 
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Abstract. This paper introduces a new class of systematic experimental stud- 
ies targeted towards a better understanding of the strengths and weaknesses of 
knowledge acquisition (KA) methods. We model a domain along with the be- 
haviour of a domain expert. Using these models we can simulate the KA process 
and observe which factors of the domain, the expert or the KA technique, affect 
the overall result of the KA process. On the basis of our models, we can also com- 
pare the performance of our KA techniques against the performance of automatic 
KA techniques, i.e. against machine learning techniques. 

We present a number of results from our modelling approach. These results in- 
clude the surprising fact that in some domains, building a decision tree by consult- 
ing an expert for providing a correct discriminating attribute along with a correct 
threshold value for a presented case, may still be inferior to an automatic method, 
such as C4.5, using the same set of cases. Furthermore, we obtained new insights 
into characteristics of the knowledge representation scheme being used (Ripple 
Down Rules) as well as guidelines for experts when providing knowledge. Fi- 
nally, we advocate our methodological approach for studying KA techniques to 
also being much more widely used in machine learning research. We consider 
our approach as an important methodological complement to the extensive per- 
formance comparisons in machine learning research using ’natural datasets’. 



1 Introduction 

The careful evaluation of the effectiveness of a new approach in AI is important hut 
often difficult to conduct in practice. Empirical studies have been conducted in a large 
number of subfields of AI, including theorem proving, constraint satisfaction, vision, 
machine learning and neural networks. Often UCI datasets are used to allow compari- 
son of approaches and evaluation studies. The UCI machine learning (ML) repository, 
contains a number of datasets obtained from real applications. Evaluations in knowl- 
edge acquisition (KA) have been very limited in regards to the actual knowledge acqui- 
sition process and those few studies that have been done, were using real data, partly 
from the UCI ML repository. 

This paper demonstrates the use of simulated domains in investigating the strengths 
and weaknesses of knowledge acquisition techniques. A model of a domain is devel- 
oped that provides a source of cases, a target function which is the goal of the induction 
or knowledge acquisition process and also a simulated expert, which can be consulted 
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in the knowledge acquisition process. The major insight that has arisen from the work 
reported in this paper is that the simulated domain approach enables one to investigate 
how the domain structure affects knowledge acquisition much more readily than us- 
ing ‘real- world’ data where the target function is unknown. Webb [18] is one of the 
few evaluation studies that have been done with human subjects as experts. Simulated 
experts may suffer somewhat from inaccurate expert models but allow more compre- 
hensive and detailed studies. 

In this paper, the knowledge acquisition technique of Ripple Down Rules (RDRs) 
will be studied in regards to how quickly an accurate knowledge base can be developed, 
depending on the structure of the domain. We also compare RDR’s performance against 
the performance of an automatic technique, i.e. the decision tree learner C4.5 [13]. Sec- 
tion 2 provides a survey of previous evaluation studies in KA. It is assumed that the 
reader is familiar with C4.5 but may not be familiar with RDR. Hence, a brief overview 
will be presented in Section 3. The target concepts used in the studies presented in sec- 
tion 4 will be hyper-rectangles in an n-dimensional space and decision trees. Section 5 
contains the conclusions. 



2 Previous Evaluation Studies 

There has been a strong emphasis on evaluation in the KA community [15]. However, 
the used approaches do not really evaluate the actual process of knowledge acquisition 
from an expert [12]. The major focus of the KA community at the time was on problem 
solving methods so that the major focus of evaluation was on evaluating the appro- 
priateness of the problem solver for the problem rather than evaluating the process of 
building the actual problem solver. 

The origins of this work are investigations into a class of KA techniques known as 
Ripple-Down Rules (RDR) [3]. Since RDR is critically concerned with populating a 
knowledge base, a different approach to evaluation was developed early on [4], In this 
approach knowledge bases were first developed by machine learning using some UC 
Irvine data sets. These knowledge bases were then used as ‘experts’ to build further 
knowledge bases using one or more RDR techniques. Similar recent work is found 
in [17]. Comparisons of the knowledge bases built by knowledge acquisition (RDR) 
were made with machine learning techniques using training sets of various sizes. These 
approaches demonstrated that RDR produced surprisingly compact knowledge bases 
and, because of the use of an expert, the early performance of the RDR system was 
always better than that of machine learning alone. As with machine learning studies 
using these data sets, the techniques perform differently on different data sets and one 
is faced with trying to understand what features of the domains cause the variations 
in performance. The hypothesis of the present study is that by building various domain 
models or target functions, and using these both to produce data and to act as the ‘expert’ 
one will be able to better understand why various learning and knowledge acquisition 
techniques perform as they do. Indeed, this paper presents results that contradict some 
of the conclusions drawn from those studies using the ‘real data sets’. 

Recently a more general simulation study has been conducted where the key mea- 
sures have been the tendency of the expert to over-generalise or over-specialise in the 
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rules they provide [2]. In the present work, however, a much ‘finer-grained’ simulation 
approach is presented which allows much deeper insights into the domain structure and 
its impact on the KA process. 

There have been many studies using simulated data in machine learning evaluation. 
However, in most cases these have not been used to explore the way in which the do- 
main structure affects the induction and in general the machine learning community has 
preferred real world data. There are few exceptions to this work, such as the early in 
depth studies by Rendell and Cho [14] or the studies of boosting by statisticians [7], 
but also some other ML studies, such as [5]. Perhaps most notably in this regard is the 
field of reinforcement learning in which artificial domains are rafher the norm than an 
exception. Langley [11] wrote a brief but very good motivation for experimental studies 
using artificial data. 

3 Ripple Down Rules 

3.1 Background 

The key features of the various Ripple Down Rules (RDR) techniques are: 

1 . the expert monitors the system while in use. Whenever the system does not perform 
to the satisfaction of the expert, e.g. when it misclassifies a case, the expert adds 
another rule to the so-called RDR tree. 

2. the use of exception structures so that errors in the system are corrected by adding 
refinement rules. Rules are not edited to correct errors. All corrections to the system 
are made by adding new rules. 

3. the incremental development while the knowledge base is in actual use. RDR sys- 
tems can be built off-line, but the technique is aimed at building systems by cor- 
recting errors while in use. 

RDR systems have been implemented for a range of application areas and tasks. 
The first industrial demonstration of this approach was the PEIRS system, which pro- 
vided clinical interpretations for reports of pathology testing [6]. By now there is a quite 
substantial history of success with this approach to knowledge acquisition, despite the 
considerable initial skepticism in the knowledge acquisition community about an incre- 
mental development where the only form of correction was the addition of new rules. 
The approach has also been adapted to tasks including multiple classification [8], con- 
trol [16], heuristic search [1], spoken-language dialogue systems [10], and document 
management [9]. The essential idea of RDR is exhibited in the Single Classification 
Ripple Down Rules. Hence, we used those for the studies presented in this paper. 

3.2 Single Classification Ripple Down Rules 

A Single classification ripple down rule (SCRDR) tree is a finite binary tree with two 
distinct types of edges. These edges are typically called exception (or true) and false 
edges. They are used for evaluating cases and will be discussed later. Associated with 
each node in a tree is a rule. A rule has the form: if a then /3 where a is called the 
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condition and P the conclusion. In this study, binary classification is often used. In such 
cases, P can be considered as having a value of positive or negative. The condition 
will typically be either a single attribute-value test or a conjunction of such tests. See 
Figure 1. 

An example (or synonymously case) is a point in an n-dimensional space and a 
SCRDR tree is used to assign a conclusion to such points. A case in SCRDR is evaluated 
by passing it to the root of the tree. At any node in the tree, if the example entails the 
condition of the node, the node is said to fire. If a node fires, the example is passed to 
the next node following the except (or true) branch. Otherwise, the case is passed down 
the false branch. This determines a path through a SCRDR tree for an example. The 
conclusion given by this process is the conclusion from the last node which fired on 
this path. To ensure that a conclusion is always given, the root node typically contains 
a trivial condition which is always satisfied. This node is called the default node. 

A SCRDR tree is revised when the evaluation process returns the wrong conclusion. 
When this happens a new node is placed at the end of the evaluation path which gave 
this wrong answer. The example causing this change (call this example e) is associated 
with the new node and is called the cornerstone case for the node. To determine the 
rule for the new node, the expert formulates a rule which is satisfied by e but not by the 
cornerstone case for the last node which fired in the evaluation path'. 




Node 6 Node 7 



Fig. 1. An example SCRDR tree. Node 1 is the default node. 



The strength of the RDR approach lies in the fact that it is usually rather easy for 
an expert to provide discriminating conditions between two presented cases, while it is 
difficult for an expert to provide valid and complete classification rules for the general 
case. 



* Note that if the last node to fire in the evaluation path is a leaf node, the new node is attached 
to the true branch of that node. Otherwise, the new node is attached to the false branch of the 
last node in the evaluation path. 










Simulations for Comparing Knowledge Acquisition and Machine Learning 277 



4 Simulation Studies 

The following studies consider an n-dimensional feature space ranging from 0.0 to 
100.0 in each dimension. Examples were randomly drawn according to a uniform dis- 
tribution in the feature space. Different types of models were again randomly created 
and used to classify the randomly generated examples. The results plotted in the follow- 
ing graphs are the averaged results over 20 randomly generated target concepts of the 
same type as explained below. Experiments indicated that further averaging over trials 
with different seeds for the generation of examples has only a negligible effect on the 
results. The simulated expert is assumed to derive answers to the KA requests from the 
generated models as explained in the following. 

In the following we study three types of target concepts: a) Sets of hyper-rectangles, 
which correspond to disjunctive concepts in numerical domains; a rather natural class 
of concepts - at least for human conceptualisation, b) Sets of nested rectangles, a class 
of concepts where one would expect the Ripple Down Rules to perform better that de- 
cision tree learning, c) Decision trees, i.e. concepts which use the same representational 
scheme as our decision tree learner C4.5. 

4.1 Sets of Hyper-Rectangles 

We considered the following class of domains: In the n-dimensional Euclidean space 
we assume a number of rectangular areas (hyper-rectangles) to cover all positive exam- 
ples while the negative examples lie in the remaining area. See Eigure 2. 

Each hyper-rectangle is randomly generated as follows: Eor some dimensions, the 
hyper-rectangle will extend over the entire value range, while for other dimensions, a 
constraining interval will be imposed. Which dimensions are chosen for the constrain- 
ing intervals is decided randomly using a uniform distribution. To define a constraining 
interval two numbers are randomly generated between 0.0 and 100.0. 



Y 




Fig. 2. A set of four hyper-rectangles in the 2-dimensional space. The area inside any of the 
rectangles belongs to class 1, while the remaining area belongs to class 0. 



Expert model: The expert is always asked to provide one or multiple significant dif- 
ferences between two cases belonging to two different classes. That is, the expert is 
presented one case which is inside one of the hyper-rectangles and a second case which 
lies outside. It appears plausible that in such a case, the expert is able to provide at least 
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one of the edges of the cube containing the positive example, l.e. the expert provides 
both, the relevant attribute and the correct threshold value. This seems plausible, as the 
expert will usually be able to articulate discriminatory conditions, but only in the con- 
text of two concrete cases. For example, in a medical domain, given two cases the expert 
may realize that body temperature is critical in assessing the case and the expert will 
usually be able to provide sensible threshold values of the critical fever level. Practical 
experiences with the RDR approach confirm the plausibility of our hypothesis. 

In the following experiments we varied a constraint on the number of conditions (to 
be used in conjunction), which the expert may provide for a single pair of cases that 
need to be discriminated. 




Fig. 3. A 20 hyper-rectangles problem with 4 constraining attributes each. (Left) Number of ex- 
amples (x-axis) against error rate of RDR (y-axis). (Right) Number of examples (x-axis) against 
RDR tree size (y-axis). 



One of the more surprising result we obtained is shown in Figure 3, where C4.5 
outperformed RDR when the expert was only allowed to provide a single attribute plus 
threshold as condition. Only where the expert was allowed to provide two or more 
discriminating conditions (used in conjunction), the error of RDR was less than the 
error of C4.5. This effect was visible only for domains of relatively high complexity, 
i.e. with 20 or more hyper-rectangles each having 4 constraining intervals randomly 
chosen from the 20 attributes. However, it should be noted that the 20 hyper-rectangles 
have only 160 edges altogether. The actual tree sizes generated are much larger. This 
is due to the tree structure, which in this case, does not allow a compact representation 
of the domain. With fewer hyper-rectangles, all versions of RDR outperformed C4.5, 
which we expected, as RDR appears to have superior resources, by being able to consult 
the expert. With more hyper-rectangles, the discrepancies between the three methods 
increased further. 

In Figure 3 the tree sizes are shown. C4.5 comes out the smallest, while RDR with 
single attribute conditions grows more than double the nodes from 100 000 examples. 
Both results contradict conclusions from earlier studies in [4] which used ‘real world’ 
data sets. 
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4.2 Nested Hyper-Rectangles 

The next class of domains we studied are described by a set of nested hyper-rectangles 
which classify cases in alternating order. That is, the outer most hyper-rectangle Ci clas- 
sifies positively, while hyper-rectangle C 2 inside Ci classifies negatively, while hyper- 
rectangle C 3 inside C 2 represents an exception to C 2 and classifies again positively, etc. 
We expected that RDR performs well on this type of domain, as the nested structure of 
hyper-rectangles corresponds to the exception structure, for which RDR seems so well- 
suited. At least if multiple attributes were allowed to be provided as conjunction by the 
expert, RDR should outperform C4.5. 

Surprisingly, the simulation studies showed that in such domains RDR does not 
perform any better than C4.5. See Figure 4. What is even more surprising is the fact 
that the number of attributes to be provided per rule does not make any substantial 
difference. 

One would think that RDR can tap into much richer resources (asking the expert for 
a discriminating attribute plus the exact threshold value) and the provision of multiple 
attributes in a condition appears ideally suited to the domain. In contrast, C4.5 has to do 
that job merely on the basis of the available, and often somewhat insufficient, training 
data. It is also surprising that for this class of domains the tree sizes are essentially the 
same, regardless whether developed by C4.5 or grown by RDR. 

From the studies of the non-nested hyper-rectangles we can derive the guideline 
for the expert to provide as many discriminating conditions as possible when presented 
with two cases of different classes. 

4.3 Decision Trees 

Another way of defining a target concept is as a decision tree. In the study presented 
here, random decision trees are generated given parameters for the number of attributes 
present, the depth of the decision tree, and the number of attributes which may occur in 
a test at internal nodes in the decision tree. A random binary conclusion is then assigned 
to each leaf node. 

Expert model: In the RDR framework, the only task of the expert is to differentiate 
pairs of examples. Given two examples, an expert creates a rule which one example 
satisfies and which the other contradicts. One way of generating such a rule from a 
decision tree is to traverse the tree from the root and return the first attribute test which 
sends the two examples down different branches. 

This simulated expert who picks the first test in the model tree as described, called 
expert 1, has the property that the rules generated for two examples (depending on 
which case is misclassified) are negations of each other. This renders the order of pre- 
sentation of the two cases irrelevant for the chosen discriminating condition in contrast 
to the following approach. 

Another way of generating a rule is to place emphasis on the misclassified example. 
Such an expert would make rules which are more specific to the misclassified example. 
More formally, consider the evaluation path of a misclassified example. The attribute 
test which is closest to the leaf of this path and which distinguishes the cornerstone case 
could be used. We call the expert who follows this approach expert 2. 
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We are interested in the performance of the two experts in regards to the error rate 
and RDR tree size depending on the complexity of the domain and the number of pre- 
sented examples. Figure 5 shows the result of running trials of increasing target concept 
complexity. The number of training examples and the number of attributes in the state 




Fig. 4. Number of examples (x-axis) against 
RDR error rate (y-axis) on a 10 nested hyper- 
rectangles problem with 4 constraining at- 
tributes for each hyper-rectangle. Similar re- 
sults were obtained for varied numbers of 
nested hyper-rectangles and varied numbers 
of constraining attributes. 



Fig. 5. Increasing error rate (y-axis) with in- 
creasing target concept complexity (depth of 
the balanced decision tree on x-axis) based 
on a fixed number of 10 000 training exam- 
ples. 



space were kept constant for each trial. In this case, there were 20 attributes and 10 000 
training examples. The results of varying the number of training examples can be seen 
in Figure 6. For this graph, the complexity of the target concept was kept constant at a 
depth of 10 with 20 attributes available. In both tests, the pruned error rate has not been 
plotted because the difference is tiny. Similar shapes were found for the graphs for trees 
of less complexity, although, as one would expect, the convergence speed was greater. 

The immediate and striking feature of these results is that expert 1 for RDR produces 
the lowest error rate. C4.5 sits in the middle with expert 2 performing the worst. This 
conclusion has to be weighed and balanced. One advantage gained by RDR is that it 
has greater resources. The simulated expert has direct access to the target concept and 
generates rules which are attribute tests in the target concept. This is tempered by the 
fact that RDR is an incremental learner. Depending on the order of examples, different 
rules will be returned by the simulated expert. Of particular importance are the rules 
returned by the first few training examples. These rules will occur close to the root 
of the RDR tree and have a greater impact on classification and on the structure (and 
size) of the emerging RDR tree. With expert 1 , these rules will more than likely come 
from near the top of the target decision tree. Starting from the root of the decision tree, 
each attribute test has (on average) a 50% chance of distinguishing two examples^. 
In contrast, expert 2 will generate rules which come from close to the bottom of the 

^ For any two examples, the chances that some attribute test above depth i will distinguish them 
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decision tree. These results suggest that, when the target concept is a decision tree, the 
expert should give equal weight to the cornerstone case and the misclassified case and 
not pay special attention to the misclassified case. 




Fig. 6. A model decision tree of depth 10. (Left) Decreasing error rate (y-axis) with the increasing 
number of training examples (x-axis). (Right) Increasing tree size(y-axis) with the number of 
training examples (x-axis). 



Another interesting finding concerns the tree size. An example of our results is 
shown in Figure 6. The data comes from the same trials used to generate the error 
rates in Figure 6. The results are quite typical and show that, besides giving lower error 
rates, expert 1 generates rather compact theories, while the size of the RDR tree for 
expert 2 is some 4 times greater than C4.5 trees. Once again this result corroborates the 
conclusion that the expert should view the cornerstone case and the misclassified case 
symmetrically. 

Simulation results also show that the comparative performance of RDR is better 
when there are fewer examples and when the target concept is more complex. Figure 5 
shows that expert 1 always outperforms C4.5. To determine how much RDR outper- 
forms C4.5, errors can be introduced into the simulated expert to degrade the perfor- 
mance of RDR. The degree to which RDR outperforms C4.5 can then be measured by 
the amount of error needed to make RDR give the same error rate as C4.5. The error 
model used here is to vary the threshold value in rules created by the simulated expert. 
If the correct threshold is x, an error of 5 will mean that the threshold returned has a 
random value between a; — 5 and a; -I- 5. The results obtained show that the error tolerance 
of RDR is far greater when there are fewer examples or when the target concept is more 
complex. Figures 7 and 8 show typical results of varying the number of examples and 
the depth of the target concept respectively. The error tolerance figures are approxima- 
tions because linear extrapolation was used between consecutive integer error values. 

Increasing the number of conclusions (increasing the range of values a leaf node 
in the target concept can take) produces an improved result for C4.5 and a slightly 
deteriorating result for RDR. Typical results can be seen in Figure 9. This result for 
C4.5 is surprising for three reasons. Firstly, the number of cuts and divisions to the 
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Examples 


Error tolerance 


50 


> 50 


100 


29.0 


200 


16.3 


500 


7.2 


1000 


3.9 


2000 


1.7 


5000 


< 1 



Fig. 7. Error tolerance of RDR when the 
number of training examples is varied. Tar- 
get concepts were all of depth 8 with 10 at- 
tributes available. 



Concept Depth 


Error tolerance 


7 


< 1 


8 


1.6 


9 


2.1 


10 


4.2 


11 


4.9 


12 


6.7 


13 


8.5 


14 


11.3 


15 


14.6 


16 


20.7 



Fig. 8. Error tolerance of RDR when the 
depth of the target concept is varied. In all 
instances 10 000 training examples were 
given. 




Fig. 9. Varying the number classes. Target concepts of depth 7 with 12 available attributes and 5 
000 training examples. 



State space is determined by the depth of the target concept. Increasing the conclusion 
number only increases the number of labels with which the divisions can be named. 
Secondly, guessing becomes harder. With fewer conclusions, a guess is more likely to 
produce the correct answer than when there are more conclusions. At this stage, we 
suspect that the slight tendency of RDR to deteriorate is due to this factor. Finally, 
when there are fewer conclusions, the target concept often receives some pruning. For 
instance, consider a target concept with two conclusions. When two adjacent leaf nodes 
are attributed the same class (there is a 50% chance of this happening), the last attribute 
test is rendered redundant. 

With fewer conclusions, the regions in the state space ascribed any one particular 
conclusion will tend to be more numerous and more scattered. Any single attribute test 
is unlikely to demarcate the regions for one conclusion. In contrast, when C4.5 deals 
with many conclusions, it is more likely that C4.5 will find a split on an attribute which 
does not split apart the examples belonging to the same conclusion. 
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5 Conclusions 



We presented a new approach to systematic studies in knowledge acquisition research 
and we believe that more studies using models of experts will allow substantial progress 
towards a better understanding of the strengths and weaknesses of a KA technique. 
While using artificial data sets is not new in machine learning research, it has not been 
widely used as a tool to obtain a better understanding of the domain characteristics 
that determine the relative success or lack of success of a given technique. We expect 
that important insights into characteristics of ML techniques can be obtained using our 
presented methodology. 

We demonstrated the usefulness of our simulations by presenting a number of counter- 
intuitive results, which were partly in contradiction to conclusions drawn from previous 
studies based on poorly understood ‘real world’ data. 

i) Despite the fact that RDR seems to have signihcantly more resources to draw 
information from, in some domains, C4.5 achieves a lower error rate and more compact 
trees than RDR, based on the same training examples. This can be interpreted as show- 
ing the inadequacy of the used knowledge representation scheme and suggests further 
research towards enhancing the SCRDR knowledge representation. 

ii) In the hyper-rectangles study, it was demonstrated that an expert who provides 
as many discriminating conditions as possible, would build a more compact and more 
accurate knowledge base faster. 

iii) In the decision tree studies, we demonstrated that it will be benehcial if the ex- 
pert would not so much focus on the misclassified case as an exception to the existing 
knowledge base, but considers both cases on a rather equal footing when formulat- 
ing a discriminating condition. Furthermore, results show that RDR performs better, 
when compared to C4.5, on domains with fewer examples, greater complexity or fewer 
classes. 

These guidelines for the experts when interacting with the system depend on the 
structure of the domain, which is in practice not exactly known. However, we believe 
that often some insight into a domain is available which may suffice to assess roughly 
what type of domain one is dealing with. For example the hyper-rectangles represent 
disjunctions of multiple interval constraints. If one is to classify objects broadly, such 
as cars into fast and slow cars, hyper-rectangles seem to be a reasonable model, where 
one can formulate, e.g. intervals in engine power depending on other features such as 
type of car, etc. 

Future research will develop in a number of directions. For knowledge acquisition 
in the style of RDR, it is desirable to represent a number of hyper-rectangles in a more 
compact form than a single RDR tree, to allow suitable treatment of the disjunctive 
nature of the domain. The effects of non-uniform distributions of examples using the 
presented models will also be of significant interest. Another interesting direction to 
push this work is to see how sensitive RDR is to errors which an expert can make in 
formulating rules. For instance, the threshold for an attribute test could be perturbed by 
some degree. This may result in larger RDR trees and higher error rates. 
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Abstract. Neural networks are recognised as an effective tool for predicting 
stock prices (Shin & Han, 2000), but little is known about which configurations 
are best and for which indices. The present study uses genetic algorithms to find 
a near optimal learning rate, momentum, tolerance and network architecture for 
47 indices listed on the Australian Stock Exchange (ASX). Some relationships 
were determined between stock index and neural network attributes, and 
important observations were made for the further development of a 
methodology for determining optimal neural network configurations. 



Keywords, genetic algorithms, neural networks, stock forecasting. 



1. Introduction 

Many attempts have been undertaken at adopting artificial intelligence methods for 
forecasting share prices on the stock market, which are starting to supersede 
traditional computational methods. For the neural network (NN) method it is not 
known how to produce a near optimal configuration. The present study uses genetic 
algorithms (GAs) to near-optimally configure NNs for particular forecasts. A 
qualitative analysis of the configurations corresponding to particular indices seeks to 
explain the factors in the GAs’ selection of that configuration. 



1.1. Genetic Algorithms 

GAs are an abstraction of the principles of genetic evolution (Kuo et al., 2001). A 
genetic algorithm is a string of binary digits (bits) that encode an algorithm. The 
fitness of that algorithm is determined by its performance on a given task, and a 
population of GAs can be evolved to improve the fitness of individual algorithms. 

The process undertaken for each generation is as follows: 

1. Determine the “fitness” of each algorithm according to a fitness function. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 285-296, 2001. 
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2. Take random pairs of the fittest algorithms out of the population and cross over 
random portions to produce progeny. 

3. Mutate a small number of bits by making random changes. 

All three steps ensure the fitness improves. By only breeding the fittest algorithms, 
the aspects that make them the fittest remain. By crossing over algorithms, portions 
that contribute to fitness can be combined. By mutating bits, aspects that contribute to 
fitness can emerge. 

1.2. Neural Networks 

Neural networks in a computing sense are an abstraction of the biology of the brain 
(Rogers, 1997). The network is made up of a number of layers. Each layer contains 
several neurons, each of which has connections to every neuron in the adjacent layer. 
Each connection has a certain weight associated with it, which is multiplied by the 
output value of the neuron at one end and forms the input for the neuron at the other 
end. The sum of the inputs, x, of a particular neuron determine the output of that 
neuron according to a transfer function, f(x). It is the changes in the connection 
weights that allow the network to learn. 

A back propagation neural network (BPNN) contains at least one input layer of 
neurons, one output layer, and any number of hidden layers (Eigure 1). The network is 
trained by comparing observed and expected output values, and adjusting the weights 
of the links according to the error. The change in weight, •, is equal to the product of 
the initial weight, the learning rate and the error. These changes in weights propagate 
back through the network after comparing the observed and expected output, hence 
the name. 




Fig. 1. Back propagation neural network 



1.3. Financial Forecasting 

There seems to be consensus among many researchers that NNs are superior to other 
methods of stock market forecasting or, at the very least, a reliable method. Aiken & 
Bsat (1999) explain the advantage NNs have over traditional statistical models for 
financial forecasting: they are “fault tolerant,” do not “require assumptions about the 
data,” and can “deal with missing data.” Wittkemper & Steiner (1996) demonstrated 
this advantage, applying GAs to NNs and producing better predictive results than 
traditional statistical methods. 

There are several other statements that back the superiority of NNs for financial 
forecasting and their popularity amongst researchers. “The application of NNs for 




Application of Genetic Algorithms 287 



decision support is well documented” (Kumar et al., 1997). “NNs are gaining 
popularity for solving several business and technical problems that involve 
prediction” (Sexton & Gupta, 2000). “NNs are an ideal choice for flexible non-linear 
modelling and are gaining attention in the area of stock prediction” Qi (1999). 

Despite their performance, NNs are not the perfect solution to forecasting. Wong & 
Selvi (1998) point out two limitations to neural networks: they require large data sets 
(strategic decision-making is non-routine); and they are unable to explain their 
decisions. An extensive literary analysis shows that neural networks generally 
outperform statistical models (Wong & Selvi, 1998). 

The availability of material relating to what data should be analysed by NNs to 
produce good results is limited. According to Shin & Han (2000), multi-resolution 
learning significantly improves the generalisation ability of NNs. Kuo, et al. (2001) 
were the first to suggest that the consideration of qualitative factors improve the 
forecasting ability of NNs. The present study uses historical share price data. 

Researchers have given little attention to the reasoning behind their choice of NN 
configuration. However, according to Qi (1999), “It has been widely accepted that a 
three-layer feedforward network with an identity transfer function in the output units 
can approximate any continuous function arbitrarily well given sufficiently many 
middle-layer units” - an hypothesis that will be tested by this study. 



Methodology 

Stocks from the top 50 leaders in the ASX June 2001 were used, excluding AXA Asia 
Pacific, News Corporation Preferred and NRMA Insurance Group, as data was not 
available for these indices for the period from which data was obtained. The indices 
and their corresponding ASX codes are listed in Table 1. 

For each index, 50 generations of 20 genetic algorithms were run. Each genetic 
algorithm contained 80 binary digits (bits). The first 32 bits represented the hidden 
layers of a neural network (NN), and the last 48 bits represented the learning rate (L), 
momentum term (M) and tolerance (T) for the NN. 

The first 32 bits were divided into groups of four. The digital representation of the 
first three bits of each group gave the number of neurons in each hidden layer from 0 
to 7. A value of 0 represented the non-existence of the corresponding layer. The 
fourth bit in each group indicated whether the number of neurons in that layer should 
be added to the number of neurons in the next layer (Figure 2). 

0110100110101110... 



3 4:5 7 

9 



Fig. 2. Conversion of GA to NN configuration 
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Table 1. ASX codes and names of stock indices used. 



ASX Code 


ASX Name 


Group 


Dur. 


/3 


ES (%) 


AGL 


AUSTRALIAN GAS LIGHT FPO 


Ind. 


10.00 


0.76 


75 


ALL 


ARISTOCRAT LEISURE FPO 


Ind. 


10.00 


1.19 


62 


AMC 


AMCOR LIMITED FPO 


Ind. 


15.00 


1.72 


75 


AMP 


AMP LIMITED FPO 


B & F 


3.05 


0.90 


89 


ANZ 


AUSTRALIA AND NZ FPO 


B & F 


18.00 


0.99 


80 


BHP 


BHP BILLITON LIMITED FPO 


Res. 


14.00 


1.46 


67 


BIL 


BRAMBLES INDUSTRIES FPO 


Ind. 


10.00 


0.68 


77 


CBA 


COMMONWEALTH BANK. FPO 


B & F 


9.81 


0.99 


80 


CCL 


COCA-COLA AMATIL FPO 


Ind. 


10.00 


0.47 


71 


CML 


COLES MYER LTD. FPO 


Ind. 


10.00 


0.45 


74 


CPU 


COMPUTERSHARE LTD FPO 


B & F 


7.10 


1.11 


61 


CSL 


CSL LIMITED FPO 


Ind. 


7.07 


0.86 


74 


CSR 


CSR LIMITED FPO 


Ind. 


20.00 


1.17 


54 


CWO 


CABLE & WIRELESS OPTUS FPO 


Ind. 


2.63 


1.08 


60 


ERG 


ERG LIMITED FPO 


Ind. 


10.00 


1.08 


60 


FBG 


FOSTER'S BREWING FPO 


Ind. 


10.00 


0.52 


86 


FXJ 


FAIRFAX (JOHN) FPO 


Ind. 


10.00 


1.00 


71 


GMF 


GOODMAN FIELDER FPO 


Ind. 


10.00 


0.47 


71 


GPT 


GENERAL PROP. TRUST UNIT 


B & F 


10.00 


0.58 


NA 


HVN 


HARVEY NORMAN FPO 


Ind. 


10.00 


0.45 


74 


LLC 


LEND LEASE CORP. FPO 


B & F 


14.00 


1.12 


65 


MBL 


MACQUARIE BANK LTD FPO 


B & F 


4.93 


0.99 


80 


MGR 


MIRVAC GROUP STAPLED 


B & F 


2.04 


0.58 


NA 


MIM 


M.I.M. HOLDINGS LTD FPO 


Res. 


10.00 


1.42 


56 


NAB 


NATIONAL AUST. BANK FPO 


B & F 


10.00 


0.99 


80 


NCP 


NEWS CORPORATION FPO 


Ind. 


14.00 


1.00 


71 


NDY 


NORMANDY MINING FPO 


Res. 


10.00 


1.10 


51 


ORI 


ORICA LIMITED FPO 


Ind. 


3.41 


0.94 


78 


PBL 


PUBLISHING & BROAD FPO 


Ind. 


10.00 


1.00 


71 


POP 


PACIFIC DUNLOP FPO 


Ind. 


12.00 


0.70 


78 


QAN 


QANTAS AIRWAYS FPO 


Ind. 


5.93 


0.68 


77 


QBE 


QBE INSURANCE GROUP FPO 


B & F 


10.00 


0.90 


89 


RIO 


RIOTINTO LIMITED FPO 


Res. 


4.08 


1.46 


67 


SGB 


ST GEORGE BANK FPO 


B & F 


9.01 


0.99 


80 


SME 


SUNCORP-METWAY FPO 


B & F 


4.15 


0.99 


80 


SMI 


SMITH (HOWARD) FPO 


Ind. 


10.00 


0.70 


78 


SRP 


SOUTHCORP LIMITED FPO 


Ind. 


7.59 


0.70 


78 


STO 


SANTOS LTD FPO 


Res. 


20.00 


0.97 


58 


TAH 


TABCORP HOLDINGS LTD FPO 


Ind. 


10.00 


1.19 


62 


TLS 


TELSTRA CORPORATION. FPO 


Ind. 


2.68 


1.08 


60 
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WBC 


WESTPAC BANKING CORP FPO 


B & F 


20.00 


0.99 


80 


WES 


WESFARMERS LIMITED FPO 


Ind. 


10.00 


0.70 


78 


WFT 


WESTFIELD TRUST UNIT 


B & F 


10.00 


0.58 


NA 


WMC 


WMC LIMITED FPO 


Res. 


14.00 


1.46 


67 


WOW 


WOOLWORTHS LIMITED FPO 


Ind. 


10.00 


0.45 


74 


WPL 


WOODSIDE PETROLEUM FPO 


Res. 


10.00 


0.97 


58 


WSF 


WESTFIELD HOLDINGS FPO 


B & F 


10.00 


1.12 


65 


Median 






10.000 


0.990 


74.0 


Mean 






9.883 


0.930 


71.75 


SD 






4.310 


0.301 


9.61 



The last 48 bits were divided into three groups of 16. The reciprocal of the digital 
representation of each group represented L, M and T respectively. 

Each NN contained 14 input layers and 7 output layers. The stock index data was 
organised into 19 groups for the months of June 1999 to January 2001. Each group 
contained the closing price for the stocks on the 21 trading days beginning on the first 
trading day of each month. If data was not available for a particular index for a 
particular trading day, that data was filled in from the most recent previous trading 
day. 

Because the NN software only accepts input and output values between 0 and 1, 
the stock prices were converted thus: 

J ^ s„ -min(5) (1) 

max(5) - min(5) 



A sigmoid transfer function was used for the firing of neurons: 

1 (2) 

l + e“" 

Each NN was trained using backpropogation (BP) over a maximum of 100,000 
epochs using the 12 data sets from the 1999-2000 financial year. The fitness of the 
algorithm was determined as the error of the NN when applied to the 6 data sets for 
the first 6 months in the 2000-2001 financial year. The error for each data set was 
specified by the NN software as: 

Eor the error over the six data sets, the negative Root Mean Squared Error (RMSE) 
was taken. The negative RMSE was used because a low error should indicate a high 
fitness, and vice-versa: 







( 4 ) 
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The aim of the to determine a relationship between the attributes of the optimal NN 
configurations and the attributes of the stock indices. The attributes considered were 
the learning rate (L), momentum (M) and tolerance (T) chosen by the algorithm, the 
number of layers, I (depth) of the network, the mean number of neurons in each layer 
(5) the accentuation (6) of the network configuration and the left (7) and right (8) 
triangular orientations. 

n, + n-, + ... + n, (5) 

n = — ^ 

I 



If / = 1 then A = 0 



If / > 1 then A = 



(«2 



ttj ) + (hj — ttj ) + . . . + (n, — ) 

n(l-V) 



(6) 



If / = I then TL = 0 



(7) 



If / > I then TL ■ 



n, 2n. (l-V)n. 

(n^ - — ) + (n^ —) + . . . + (n,_i ) 



If / = I then TR = 0 



(8) 



If / > I then TR 



(«, -y) + --- + («3 j ) + (^2 ^ ) 



The nearest-optimal NN, determined by the fitness function, was selected for each 
index. Its performance on forecasting the July 2001 data set was compared with the 
performance of a NN generated by another arbitrary algorithm-generated NN. 

For each attribute, the optimal NN configurations were classified as being “high” 
or “low” for that attribute according to whether they had a value for that attribute 
above (or equal to) or below the median respectively. The low and high values were 
reversed for TL and TR, as a low value implies a high triangular orientation and vice 
versa. 

This categorisation was compared against attributes of the stock indices, shown in 
Table 1. The attributes used for the stock indices were market sector grouping 
(Industrials, Resources or Building & Finance), approximate duration in years (high 
or low) up to July 2001, sector •-volitility in July 2001 (high or low), and sector 
earnings stability (high or low). As with the NN attributes, categorisation into high 
and low values for index attributes was determined by the median. 

The indices covered 22 market sectors, however, no sector covered enough indices 
to make any reasonable analysis, as the most indices covered by any sector was Banks 
& Finance, which covered seven. The Building & Finance grouping included Banks 
& Finance, Developers & Contractors, Insurance, Investment & Financial Services 
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and Property Trusts. The Industrials grouping included Alcohol & Tobacco, Building 
Materials, Chemicals, Diversified Industrials, Food & Household, Healthcare & 
Biotech, Infrastructure Utilities, Media, Paper & Packaging, Retail, Transport, 
Telecommunications and Tourism & Leisure. The Resources grouping included 
Diversified Resources, Energy, Gold and Other Metals. 



Results 

The difference in fitness of random networks and the near-optimal networks tested on 
the July 2001 data set for each share index (Table 2) was statistically significant (p < 
10*). A paired t-test with one tail was used to take into consideration the one to one 
correspondence of data elements based on the index to which they are applied. This 
result suggests some significance in the GA’s choice of NN configuration as opposed 
to the chosen NN having a similar fitness to a randomly generated NN. In other 
words, the GA had to some extent optimised the configuration of the NN. 

The optimised NNs and corresponding stock indices were categorised as HIGH or 
LOW for each attribute. This was done according to the median. The exception was 
the sector group attribute of the stock indices, which was divided into Building & 
Finance, Industrials and Resources. The values for the NN attributes are listed in 
Table 1, whereas the values for the index attributes are listed in Table 3. 

Table 4 shows the frequency of stock indices falling into particular categories for 
both NN and stock index attributes. For the purposes of this analysis, it is expected 
that particular attributes of stock indices lead to particular attributes of optimised 
NNs. Therefore a test on all category frequencies for each NN attribute for each 
stock index attribute category should show if the latter affects the former. The 
expected frequencies for the test are taken as the total frequency of the NN attribute 
category times the total frequency of the index attribute category divided by the total 
number of indices. The total number of indices was 47 except in the case of Earnings 
Stability, where there was no data for three indices in the Property Trusts Sector. 

The most significant results (p < 0.3) are listed in Table 5. Only two 
comparisons were statistically significant with p < 0.1 and another two with 0.1 < p < 
0 . 2 . 

Table 2. Performance of random and near-optimal NN configurations on each stock index. 



ASX code 


Random 


optimized 


AGL 


-1.244 


-0.479 


ALL 


-1.158 


-2.013 


AMC 


-0.439 


-1.067 


AMP 


-0.834 


-0.675 


ANZ 


-1 .077 


-0.600 


BMP 


-1 .266 


-1.077 


BIL 


-1 .560 


-1.447 


CBA 


-0.925 


-0.752 


CCL 


-1.262 


-0.097 


CML 


-1 .497 


-0.110 


CPU 


-1 .240 


-0.868 
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Table 3. Attributes of optimised NN configurations for each stock index. 



ASX code 


L 


M 


T 


/ 


n 


A 


TL 


TR 


AGL 


0.75 


0.56 


0.45 


2 


16.00 


0.50 


0.10 


1.17 


ALL 


0.11 


0.37 


0.04 


4 


3.75 


0.98 


0.67 


0.75 


AMC 


0.68 


0.93 


0.01 


1 


24.00 


0.00 


0.00 


0.00 


AMP 


0.77 


0.99 


0.40 


3 


13.33 


0.41 


0.42 


1.86 


ANZ 


0.77 


0.99 


0.39 


4 


6.50 


0.77 


0.65 


-0.05 


BHP 


0.16 


0.91 


0.09 


3 


11.33 


1.94 


2.00 


7.50 


BIL 


0.13 


0.19 


0.10 


4 


7.25 


1.47 


2.64 


8.75 


CBA 


0.00 


0.43 


0.31 


4 


9.50 


0.46 


0.08 


1.19 


CCL 


0.98 


0.75 


0.39 


5 


8.80 


0.80 


-0.04 


1.30 


CML 


0.92 


0.78 


0.06 


6 


6.00 


0.83 


-0.02 


0.30 


CPU 


0.09 


0.98 


0.47 


4 


7.00 


1.10 


0.13 


-0.14 


CSL 


0.52 


0.96 


0.59 


6 


5.00 


1.04 


0.38 


0.60 


CSR 


0.81 


0.98 


0.60 


2 


11.00 


1.45 


-0.34 


5.83 


CWO 


0.96 


0.95 


0.25 


1 


42.00 


0.00 


0.00 


0.00 


ERG 


0.21 


0.15 


0.73 


3 


10.33 


2.71 


14.67 


14.50 


FBG 


0.56 


0.66 


0.54 


5 


4.20 


0.77 


0.60 


-0.38 


FXJ 


0.98 


0.71 


0.09 


3 


13.67 


1.57 


4.29 


1.28 


GMF 


0.71 


0.59 


0.21 


3 


6.33 


1.26 


-0.27 


8.50 


GPT 


0.52 


0.27 


0.09 


4 


3.25 


0.82 


0.50 


-0.19 


HVN 


0.30 


0.48 


0.11 


6 


8.00 


0.53 


2.03 


-0.39 


LLC 


0.14 


0.09 


0.02 


2 


18.00 


1.78 


-0.44 


16.50 


MBL 


0.21 


0.35 


0.41 


4 


6.00 


0.56 


0.31 


0.68 


MGR 


0.00 


0.82 


0.29 


4 


4.25 


0.55 


5.08 


-0.21 


MIM 


0.99 


0.97 


0.25 


6 


5.67 


0.99 


-0.03 


2.50 


NAB 


0.36 


0.97 


0.51 


6 


7.00 


0.77 


0.10 


0.78 


NCP 


0.42 


0.98 


0.20 


5 


4.00 


0.38 


0.26 


4.10 


NDY 


0.98 


0.75 


0.39 


4 


4.50 


0.59 


0.17 


0.58 


ORI 


0.99 


0.82 


0.09 


1 


44.00 


0.00 


0.00 


0.00 


PBL 


0.22 


0.36 


0.75 


4 


8.00 


0.63 


0.24 


1.75 


POP 


0.49 


0.72 


0.06 


5 


8.20 


0.58 


1.26 


-0.30 


QAN 


0.95 


0.85 


0.22 


5 


9.00 


0.81 


0.09 


0.98 


QBE 


0.21 


0.69 


0.43 


6 


6.00 


0.57 


0.15 


1.50 


RIO 


0.90 


0.99 


0.03 


6 


6.67 


0.63 


0.17 


1.10 


SGB 


0.06 


0.30 


0.27 


3 


8.33 


1.14 


-0.24 


5.25 


SME 


0.79 


0.94 


0.56 


1 


36.00 


0.00 


0.00 


0.00 


SMI 


1.00 


0.88 


0.17 


1 


51.00 


0.00 


0.00 


0.00 


SRP 


0.99 


0.35 


0.25 


3 


13.67 


1.35 


-0.29 


19.50 


STO 


0.95 


0.94 


0.70 


1 


23.00 


0.00 


0.00 


0.00 


TAH 


0.91 


0.89 


0.85 


1 


44.00 


0.00 


0.00 


0.00 


TLS 


0.92 


0.97 


0.39 


5 


7.00 


0.71 


-0.01 


1.29 
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WBC 


0.00 


0.12 


0.30 


4 


3.50 


1.14 


-0.06 


3.75 


WES 


0.97 


0.99 


0.99 


1 


41.00 


0.00 


0.00 


0.00 


WFT 


0.10 


0.70 


0.35 


2 


13.00 


1.08 


-0.20 


2.83 


WMC 


0.87 


0.87 


0.09 


5 


7.00 


0.79 


0.01 


0.85 


WOW 


0.65 


0.11 


0.04 


4 


6.00 


0.11 


1.02 


0.23 


WPL 


0.51 


0.99 


0.39 


5 


5.80 


0.65 


0.36 


0.31 


WSF 


0.77 


0.99 


0.38 


5 


4.40 


0.91 


0.24 


1.85 


Median 


0.685 


0.821 


0.296 


4.000 


8.000 


0.769 


0.100 


0.850 



Table 4. Frequency of NN attributes for each category in stock index attributes. 





1 Group 1 


1 Duration | 


1 jS-volatility 1 


1 Earn. Stab. | 


B& F 






HIGH 


LOW 


HIGH 


LOW 


HIGH 


LOW 


L 


HIGH 


4 


14 


5 


15 


8 


12 


11 


10 


13 


LOW 


11 


11 


2 


18 


6 


13 


11 


13 


8 


M 


HIGH 


6 


11 


6 


14 


9 


15 


8 


10 


13 


LOW 


9 


14 


1 


19 


5 


10 


14 


13 


8 


T 


HIGH 


10 


10 


3 


16 


7 


13 


10 


11 


11 


LOW 


5 


15 


4 


17 


7 


12 


12 


12 


10 


1 


HIGH 


3 


9 


4 


12 


4 


7 


9 


8 


8 


LOW 


12 


16 


3 


21 


10 


18 


13 


15 


13 


n 


HIGH 


6 


14 


2 


14 


8 


11 


11 


12 


9 


LOW 


9 


11 


5 


19 


6 


14 


11 


11 


12 


A 


HIGH 


8 


12 


3 


18 


5 


13 


10 


9 


12 


LOW 


7 


13 


4 


15 


9 


12 


12 


14 


9 


TL 


HIGH 


8 




4 


17 


6 


12 


11 


10 


11 


LOW 


7 




3 


16 


8 


13 


11 


13 


10 


TR 


HIGH 


8 


12 


3 


16 


7 


14 


9 


9 


13 


LOW 


7 


13 


4 


17 


7 


11 


13 


14 


8 



Table 5. Comparisons of index attribute categories and NN attributes by test with p < 0.3. 



Index Attribute 


Index Attribute 
Category 


NN Attribute 


NN Attribute 
Value 


P 


Group 


Resources 


Momentum 


HIGH 


0.0515 


Group 


Building & Finance 


Learning Rate 


LOW 


0.0844 


Group 


Building & Finance 


Tolerance 


HIGH 


0.1695 


Group 


Resources 


Depth 


HIGH 


0.1971 


Group 


Resources 


Learning Rate 


HIGH 


0.2339 


Earnings Stability 


LOW 


Learning Rate 


HIGH 


0.2343 


Earnings Stability 


LOW 


Momentum 


HIGH 


0.2343 


Earnings Stability 


LOW 


Right Triangular 


HIGH 


0.2343 


;3-volatitiliy 


LOW 


Momentum 


LOW 


0.2381 


Duration 


LOW 


Momentum 


HIGH 


0.2506 


Group 


Building & Finance 


Depth 


LOW 


0.2511 


/3-volatitiliy 


HIGH 


Momentum 


HIGH 


0.2685 
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Discussion 

It has been demonstrated that configuration is an important consideration when 
forecasting stock prices with NNs. The present study has not, however, completely 
produced all the useful rules for configuring a NN for stock market forecasting based 
on the attributes of the stock index. Despite the results, the present study has managed 
to “test the water” so to speak in developing a methodology to determine the causes of 
NN optimisation 

The apparent lack of strength in the relationship between the NN attributes and 
some stock index attributes possibly suggests that the given NN attributes are not 
strongly determined by those index attributes. To find stronger relationships, more 
attributes should be considered in any future study. 

It is interesting to note that the five most significant relationships listed in Table 5 
were determined by market sector groupings, suggesting that this attribute is a 
stronger determinant than the others. Therefore another possible contribution to the 
lack of strength in these relationships is that the other three attributes are dynamic, 
and were taken from a particular point in time. Unfortunately, no archived data were 
available for previous measures of '-volatility or earnings stability. 

Also, the present study does not take into consideration whether there is some kind 
of interrelationship between the attributes of the indices or between the attributes of 
the NNs. Given the possibility of this consideration, there would be potential value in 
conducting a more detailed analysis. A proposed method of analysis that takes this 
consideration into account would be some form of machine learning. 

The categorisations of some attributes into HIGH and LOW do not allow an 
examination of the variations within these groupings. There would be more 
qualitative integrity to the analysis if a less simplified approach were taken. For three 
of the stock index attributes, relationships between these and NN attributes could be 
determined in terms of correlation. This unfortunately does not apply to market 
sectors, as there is no apparent translation from these categories to quantitative values. 

Training values such as momentum, learning rate and tolerance featured more 
prominently in significant relationships, suggesting that these attributes should be 
examined separately to those that determine attributes of the network itself. 

A casual observation of the GAs at work would show that in many cases the fittest 
generation was towards the earlier or middle generations. This suggests a high 
prevalence of local minima, which could be caused by the convolution of network 
architecture fitness by training data fitness, and possibly furthers the case for 
interrelationships between such attributes. Nonetheless, there is little to be gained 
from studying such convoluted data until the training values and network 
architectures are examined as separate considerations. 

In conclusion, there are several limiting factors to determining strong relationships 
between index and NN attributes: the lack of importance of the index attributes 
studied in determining NN configurations, dynamic attributes being considered as 
static attributes, the lack of consideration of interrelationships between index 
attributes or NN attributes, the integration of training values with NN architecture. 

It is suggested that any future studies into the optimisation of neural network 
configuration for stock market forecasting: 
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a) separate the optimisation of NN training values and architectures; 

b) use different stock index attributes to find network configuration determinants; 

c) conduct a more detailed analysis of the relationship between stock index and 
NN attributes, possibly taking into account attribute interrelationships, and 
examining relationships by correlation, not frequency. 

Further study will take these matters into consideration and refine the 
methodology. 
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Abstract. Belief update is usually defined by means of operators act- 
ing on belief sets. We propose here belief update operators acting on 
epistemic states which convey much more information than belief sets 
since they express the relative plausibilities of the pieces of information 
believed by the agent. In the following, epistemic states are encoded as 
rankings on worlds. We extend a class of update operators (dependency- 
based updates) to epistemic states, by dehning an operation playing the 
same role as knowledge transmutations [21] do for belief revision. 



1 Introduction 

While belief revision is meant to integrate new knowledge about a static world, 
belief update is usually thought of as taking account of a piece of information 
representing the effect of an evolution of the world (which may be caused by an 
event or an action) [13]. It has been shown in many places (e.g., [2] [5]) that 
iterated applications of belief revision operations need a representation of initial 
beliefs more informative than flat belief sets, namely, epistemic states. A (flat) 
belief set is a closed logical theory, which, when the language is propositional 
and generated by a finite number of propositional symbols (which is assumed 
here), is equivalently expressed by a propositional formula. An epistemic state 
is a full encoding of what the agent believes and how she is likely to revise her 
beliefs accordingly, which calls for a gradation of beliefs. This gradation is usually 
expressed by a preorder on formulas, i.e., a reflexive and transitive relation, or 
more specifically by a ranking function on formulas. 

Oddly enough, while the distinctions between revision and update have been 
extensively studied, as well as postulates and strategies for iterated belief re- 
vision, the KR community has devoted much less attention on iterated belief 
update (exceptions being [9], [17] and [19]) and even less on update on epistemic 
states. This raises the following questions, in order. 

1. Is iterated belief update as worth investigating as iterated belief revision? 

2. Can usual update operators, mapping a pair (belief set, input formula) to a 
belief set, be applied iteratively without trivialization? 

3. If so, does iteration sometimes need belief update operators acting on epistemic 
states rather than on belief sets? 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 297-308, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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The answer to Question 1. is obvious when one thinks of a belief update 
operator as a tool for computing the effects of an action given the initial beliefs 
of the agent. Actions are meant to be performed in sequence, especially when it 
comes to planning, and in this context, iterate dupdates naturally arise. 

Question 2. is more complex. We propose the following answer. The process 
of belief revision acting on belief sets is not Markovian, because revising a belief 
set by an input formula will not lead to the same result whether it comes from 
a certain sequence of revision or from another one. This can be overcome either 
by storing the whole history of the revision process or by representing knowledge 
by epistemic states, which actually amounts to the same kind of information (see 
[14] for a general discussion), namely, not only the pure beliefs are stored but also 
the way they should be revised; now, belief revision acting on epistemic states or 
on seguences of belief sets can be seen as Markovian. That belief revision on belief 
sets should not be a Markovian process is not surprising, since belief revision is 
concerned with a static world and “old” beliefs still play an important role since 
they bear on the very same world as new ones. The latter intuition does not 
carry on to belief update, or at least not with the same strength. Indeed, since 
iterated belief updates amount to performing successive actions, the obtained 
belief states represent the beliefs after each action is performed, and considering 
this process as Markovian is generally harmless. 

Now, the paper could well stop at this point, since updating epistemic states 
could be seen as a formal exercise with no other interest than making students 
work with worlds, formulas, rankings and so on. As the reader expects, this is 
not the case, and here is our answer to Question 3: there are some contexts where 
belief update on flat belief sets is insufficient. We give here two such contexts: 

Context 1: Successive applications of belief revisions and updates: Planning 
with nondeterministic actions in partially observable environments calls for plans 
that interleave “traditional” (or ontic) actions acting on the world only and 
knowledge-gathering (or epistemic) actions that do not change the state of the 
world, but the beliefs of the agent, only - their role is to render the agent in- 
formed enough so as to help her choose what to do next. Therefore, it may well 
be the case that a belief update will be followed by a revision, or even a sequence 
of revisions, which obviously calls for the need of working on rich structures such 
as epistemic states rather than on flat belief sets. 

Example 1 (Saturday night shooting). 

Bill IS a good shooter but he is sometimes drunk. When he is not drunk, shooting 
at a turkey results in the turkey being dead. When he is drunk, however, shoot- 
ing at a turkey results in the turkey hiding. Today, Bill does not look drunk - so 
that in the initial belief it is more plausible, yet not totally certain, that he’s not 
drunk; the turkey is initially alive and not hidden ( and these latter beliefs are 
certain). Bill shoots, which may be expressed by updating the initial belief by the 
formula (drunkAhiddenlVC-idrunkA-ialive), with the further constraint that 
drunk ( as well as -idrunk ) cannot be changed by the action of shooting. After a 
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few seconds, one hears the turkey goobling, which leads to a revision by alive. 
Is the turkey hidden in the final state? 

With belief change operators on flat belief sets, the initial belief is alive A-ihidden 
A-idrunk, which after update according e.g. to Forbus’ operator leads to the be- 
lief set -idrunk A-ihidden A-ialive; after revision by alive, the new belief set is 

-idrunk A-ihidden Aalive while the intended result is drunk Ahidden Aalive. 

Context 2: Evaluation of the satisfaction of a goal after a seguence of updates. 
After a sequence of updates, we may want to evaluate to what point a given goal 
is satished (this is typically looked for in decision-theoretic planning). What we 
want at the end is an epistemic state where the worlds violating the goals are 
the least entrenched ones, which needs of course to work on epistemic states. 
Consider the same example as above, with the goal of having the turkey dead at 
the end, and suppose that we have also the action bomb always resulting in the 
turkey being dead - thus performing bomb amounts to updating by -lalive. We 
should be able to conclude that the plan shoot normally succeeds but sometimes 
fails, while the plan bomb always succeed. 

We choose to model epistemic states by ranking functions on the set of propo- 
sitional worlds, or Ordinal Conditional Functions (OCF) - sometimes called 
kappa functions. This model is among the simplest ones, and it is frequently 
chosen for modelling epistemic states. For computational efhciency reasons, 
OCFs will not be represented explicitly but by a more compact way, namely, by 
means of stratified belief bases that induce a full OCF (see for instance [22]). 

In Section 2, we give the necessary background about OCFs and stratihed 
belief bases; next, we give the necessary background about dependency-based 
update. In Section 3, we show how epistemic states are updated, not only 
by single propositional formulas but more generally by pairs consisting of a 
propositional formula and a rank, and we thus propose a counterpart to belief 
update of what is known under the name of transmutation for belief revision. 
We proceed hrst by extending the notion of variable forgetting to epistemic 
states, and then we are in a position to dehne transmutations for belief update. 
We briefly show that our update operators on epistemic states are relevant for 
reasoning about action, and we conclude by discussing related work. 

2 Background and notations 

Let VAR be a hnite set of propositional variables and Cy ar the propositional 
language built upon these variables and the usual connectives. For every A C 
VAR, Cx denotes the sublanguage of Cy ar generated from the variables of A 
only. For every formula p of Ey ar, Var{p) is the set of variables occurring in 
p. Pxi^i (resp. Pxi-o) denotes the formula from Ey ar obtained by substituting 
in a uniform way the variable x G VAR by the boolean constant T (resp. T) in 
p. Full instantiations of variables of VAR are called worlds, and are denoted by 
uj, uj' etc. Full instantiations of variables of A C VAR are called A-worlds, and 
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are denoted by uix , denotes the set of all possible truth assignments 

of variables of df. Mod{ip) is the set of models of cp, i.e., the worlds satisfying cp. 

Let X and Y be two disjoint subsets of VAR, and let uix £ uiy £ 2^. We 
dehne the concatenation of uix and uiy as the world uix ■ uiy G assigning 

to each variable of df (resp. Y) the same value as uix (resp. uiy). If w is a world 
from and X G VAR then we dehne switch[ui , x) as the world obtained 

from uj by switching the truth value of the variable x. If X C VAR, we say that 
uj and uj' agree on X, denoted by ui if and only if ui and ui' assign the 

same truth value to every variable of df. 

2.1 Ordinal conditional fnnctions and stratified belief bases 
Definition 1 

— An ordinal conditional function (OCF) r is a mapping from 2^^^ to IN U oo, 
r IS said to be normalized if and only if3ui G 2^^^ such that r[ui) = 0; 

— A normalized OCF r induces an entrenchment ranking Er on Cvar defined 
by Er{p) = min^|=;^ r{ui); 

— If r and r' are two normalized OCFs, r is said to be at least as specific as r' , 
noted r > r' , if and only if for every world ui G 2^^^ we have r[p) > r'[p). 

Unless the contrary is explicitly stated, all OCFs considered in this paper will 
be normalized. 

The higher r{ui), the less plausible ui represents the actual state of the world. 
In particular, if r{ui) = oo then ui is totally impossible. The usual interpretation 
of OCFs is in terms of order of magnitude of infinitesimal probabilities [20]: 
r{ui) = i < CO means that the order of magnitude of the probability of ui being 
the actual world is in 0(e*) where e is an infinitesimal, and r{ui) = oo if ui is 
an impossible world. This interpretation implies that r should be necessarily 
normalized. 

From a practical point of view, it is not possible to ask the agent to express 
her beliefs under the form of a full OCF explicitly, since it is exponentially large 
in the number of propositional variables. Instead, it is more efficient and natural 
to represent them implicitly by means of stratified belief bases. 

Definition 2 (stratified belief bases) 

— A stratified belief base B is a finite sequence {B\, ..., Bn, 5oo) of propo- 

sitional formulae Bi. Each i in {1, . . . ,n, oo} is called a rank. B^o represents 
fully certain beliefs, B„ the most entrenched among the uncertain beliefs and 
B\ the least entrenched ones; 

— The cut of level i of a SBB B is defined by Cut{B, i) = ^ji ^ ** 
to be consistent if and only if Cut[B, 1) is consistent; 

— The OCF rs induced by the SBB B is defined by Vw £ 2'^^-^,rs(w) = 
maxji I UJ |= ->Bi} if such an index exists, and 0 otherwise; rs is normalized 
if and only if B is consistent. 

— if B IS a SBB, p a formula and i a rank in {1, . . . ,n, oo) then we let 

Add{B, p, i) = {Bi, .., 5i_i, Bi A p, Bi+i, ..., B^o)- 
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2.2 Formula-variable independence and variable forgetting 

Definition 3 (FV-independence) [16] Let cp be a formula from Cvar and 
X C VAR. Lp IS said to be independent from df if and only if there exists a 
formula i[ from Cvar logically equivalent to p which does not mention any 
variable from X . 

It is shown in [16] that p is independent from df if and only if p is independent 
from {x} for each x ^ X] we denote by DepVar(p) the set of variables p is 
dependent on. It is also shown in [16] that p is independent from x if and only 
if Px^o and Px^i are logically equivalent, from which it can be derived that 
checking whether x G DepV ar{p) is coNP-complete. 

The notion of variable elimination (also referred to as forgetting, projection 
or marginalization) is central in the following: 

Definition 4 (variable forgetting) [18] Let p be a formula from Cvar and 
X C VAR. Forget{p, X) is the formula inductively defined as follows: 

(i) Forget(p, 0 ) = p; 

(ii) Forget{p, {*}) = px^i V Px^^oi 

(ill) Forget{p, {x} U X) = Forget{Forget{p, X), {*}). 

The following characterization of variable forgetting [15] helps to understand 
how it works in practice: if p is under DNF, i.e., = 71 V . . . V where each 

is a conjunction of literals, then Forget{p, X) can be obtained by deleting 
from the y^’s all occurrences of literals x,~ix for all x (P X . For instance, let 
p = (-ifl \! b) l\{a\l c) l\(f)\l c\! d) and df = {a, d}. Since p is logically equivalent 
to (-la A c) V (a A &) V (& A c), we have Forget{p, X) = (& V c). Forget{p, X) is 
the strongest consequence of p being independent from df [16]. 

2.3 Belief update 

A belief update operator o maps the propositional belief base (a formula) K 
representing the initial beliefs of a given agent and an input formula a reflecting 
some explicit evolution of the world [13], to a new set of beliefs K o a held by 
the agent after this evolution has taken place. 

Katsuno and Mendelzon [13] proposed a general semantics for update. The 
most prominent feature of KM-updates (distinguishing updates from revision) 
is that update must be performed modelwise, i.e., Mod{K o a) = cv <> a. 

Given that updates are performed modelwise, what remains to be defined is the 
way models are updated, i.e., how w o a is defined. 

Update operators proposed in the literature can be (roughly) classified in 
two main families. Minimisation-based updates OMin (such as Winslett’s PMA 
[23]), stemming from the direct instantiation of the Katsuno-Mendelzon seman- 
tics, compute ui OMin o. by selecting the models of a “closest” to w (this notion of 
closeness being modelled by a collection of preorders on 2'^^^). Dependency- 
based updates o^gp ([10], [11], [ 6 ], [24], [12]) compute ujoa by first forgetting (from 
uj) the truth value of all variables that are “relevant” to a (leaving unchanged the 
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truth value of variables not relevant to the update) , and then expanding the re- 
sult with a; the notion of “being relevant to” is modelled by a mapping Dep from 
AR to 2^^^. Many choices for Dep are possible (see [11] for details). The most 
frequent choice for Dep is semantical dependence: Dep{a) = DepVar(a), and 
by default we let Dep = DepVar. Whatever the choice of Dep, the dependence- 
based update uj <>Dep a of a world w by a formula a w.r.t. Dep is the set of all 
worlds ui' such that uj' \= a, and for every propositional variable x from VAR 
such that X ^ Dep{a), uj and uj' assign the same truth value to x. 

Interestingly, o^gp operators can be characterized through the notion of vari- 
able forgetting dehned above. Indeed, the following holds [6]: 

K <>Dep o. = {Forget{K, Dep{a)) A a 

This result gives an intuitive understanding of how dependency-based update 
works: hrst, one forgets the variables concerned by the update, and then one 
expands by the input. 

3 Updating OCFs 

We are now going to apply the principle “forget, then expand”, at work in 
dependency-based update, to epistemic states consisting of OCFs. Therefore 
what we have to do hrst is to generalize variable forgetting to OCFs. 

3.1 Independence of an OCF from a set of variables 

Recall that variable forgetting can be characterized by the following result: 
Forget{ip, X) is the strongest consequence of p that is independent from df. 
We may thus dehne variable forgetting from an OCF by a similar construction, 
which requires hrst to dehne independence of an OCF from a set of variables. 

Definition 5 Let X C VAR. 

— An OCF r is independent from X if and only if there is a SBB B inducing 
r not mentioning any variable from X. 

— A SBB B IS independent from df if and only if its generated OCF rs is 
independent from X . 

Example 2 : B = {Bi, B'j, Bag) with Boo = a ^ b, B'j = {a ^ -ib) A (a — 
& V c) A (& — 7> d), B\ = b. The OCF induced by B is the following: 



r{uj) = oo for each uj \= a A ->b 


r{uj) = 2 for each uj \= b A ~<d 


r{uj) = 1 for each uj \= -la A ->b 


r{uj) = 0 for each uj \= -la A b A d 



The following simple result states that it is sufficient to focus on independence 
from a single variable. 

Proposition 1 r ( resp. B ) is independent from X if and only if r ( resp. B ) is 
independent from {x} for all x ^ X . 
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Therefore, all the information about sets of variables an OCF r is depen- 
dent on which can be summarized by the set of variables DepVar(r) = {* £ 
VAR I r depends on {*}} (and we dehne DepVar(B) similarly). 

It is not difhcult to verify that the SBB B' = {B[, B' 2 , B'^) with B[ = Bi, 
B '2 = {a ^ -i&) A (& — 7> d) and B'^ = B^o, induces the same OCF, i.e., rs = rs' ■ 
Therefore, B and B' are equivalent, and since c is not mentioned in B' , B is 
independent from {c} (and so is the OCF r^). On the other hand, it is dependent 
on {a}, {&} and {d}, i.e., DepVar(B) = {a,h,d}. 

The following result gives semantical characterizations of independence of an 
OCF from a variable. 

Proposition 2 Let r bean OCF and X C VAR. The following four statements 
are equivalent. 

1. r IS independent from X . 

2. For any X -worlds uix , C we have r[uivAR\x ■ ^x) = ‘>'{uivar\x ■ ^x)- 

3. For any variable x (E X and any ui G 2 ^^^, we have r[ui) = r[switch[ui , x)). 
4-. For any nontautological p such that Var(ip) C X, we have Er{ip) = 0. 

In the case where the OCF is dehned implicitly by a SBB, the next result 
gives a practical way of computing whether it is independent from {x} without 
having to write r explicitly. 

Proposition 3 

The SBB B is independent from X if and only if for all i G {1, ..., n, 00}, 
Cut{B, i) IS independent from X . 

Therefore, the problem of checking independence of a SBB from a variable 
can be reduced to a linear number of “classical” independence problems. This 
result enables us to draw generalizations of several results about formula- variable 
independence stated in [16]. In particular, determining whether B is independent 
from df is coNP-complete. 

3.2 Forgetting in OCFs 

Definition 6 Let X C VAR and r be an OCF . Forgetfr, X) is the minimal 
OCF r' (w.r.t. <) such that r' > r and r' is independent from X. 

The following result gives a semantical characterization of forgetting. 

Proposition 4 Let r be an OCF and X C VAR. Then 

Forget{r, X){uj) = min{r(w') | F E 2 ^^^ and F ar\x 

Note that when N is a singleton {x}, the latter identity becomes 
Forget{r, {x}){uj) = mm{r{uj),r{,switch{uj,x))). The previous definition and 
characterization are not operational when the OCF is represented implicitly 
under the form of a SBB. The next result tells us how to implement variable 
forgetting from a SBB in practice, namely by forgetting from the n classical 
propositional formulas Cut{B, i). 
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Proposition 5 Let B be a SBB and X C VAR. Let 

BForget{B , X) = {Forget{Cut{B , i), X))i-i 2 ^...^n,oo- Then we have 

Forget{rB,X) = rBForget{B,X) 

Example 3: Let B = {Bi, B'j, Bco) with 5 qo = T , B'j = a A c, Bi = a ^ h. 
We have BForget{B , {a}) = {b A c, c, T); BForget{B , {&}) = {a A c, a A c, T); 
BForget[B, {a, c}) = (c, T, T). 

Note that it is important to take the conjunction of the strata before forgetting: 
since Forget{B 2 , {a}) = c and Forget{Bi, {a}) = T, BForget{B , {a}) is not 
equivalent to {Forget{Bi, {a}), Forget{B 2 , {a}), Forget{Boo, {a}))- 



3.3 Updating an epistemic state by a formnla and a rank 

Let’s remind that a transmutation operator maps an OCF r, a consistent formula 
Lp and a rank i to a new OCF r*[Lp,i) such that = i (see [21]). On 

this ground, we dehne the update of r by a with rank i as the transmutation 
of Forget{r, Dep{a)) by the new belief a together with its OCF degree i. This 
supposes that a transmutation operator has been previously hxed. 

Definition 7 (U-transmntation) 

Let * be a transmutation operator, Dep a dependency function, r an OCF, a a 
consistent, nontautological formula and i a rank. The U-transmutation of r by 
(a,i) with respect to Dep and *■ is defined by 

r*(a, *)(w) = Forgetfr, Dep[a))* [a, i) 

After the forgetting process has pushed EForget{r,Dep{a)){oi) down to 0 (see 
last point of Proposition 2), the transmutation process pushes it up to the spec- 
ified level i, i.e., enforces Ero(^a,i){oi) = *• Importantly, note that the higher i, 
the less entrenched a and the more entrenched -la. Lienee, when learning a new 
fact Lp with some entrenchment degree i reflecting the evolution of the world, 
the initial knowledge base has to be U-transmuted by {-'p, i). The higher i, the 
more entrenched the new information p and the more unlikely the more plau- 
sible -ii^-worlds. The limit case of updating by a certain input p consists in 
U-transmuting by (-h^, oo) which enforces r^ {-ip, co){uj) = oo for all models of 
~^p, i.e., £'r»(-.¥3,oo)(“'<^) = oo- 

We consider now two of the most common transmutation schemes, namely 
conditionalization [20] and adjustment [21]. The following expressions can be 
derived from the above definition, the general formulations of conditionalization 
and adjustment (omitted for the sake of brevity), and the fact that for any consis- 
tent, nontautological formula o, ffi^orgfet(r,Z)ep(Q;)) (^) — -^i^orgfet(r,Z)ep(Q;)) ( '^) — 
0 (last point of Proposition 2): 
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* = conditionalization [20] 



J Forget[r, Dep[a))[ui) if ui \= ~<a 
[ Forget[r, Dep[a))[ui) + i if w \= a 



* = adjustment [21] 

,i - / Porget{r, Dep{a)){uj) if w ^ 

i ’ F ) Forget{r, Dep{a)){uj)) if uj \= a 

Two limit cases are worth considering: 

1. When i = CO - meaning, as said above, that the information -la is certain in 
the new state of affairs - then r*(a, *)(w) is independent from the choice for 



r*(a, co)(w) 



Forget{r, Dep{a)){uj) if w |= -la 

CO if w 1= a 



2. When i = 0, the transmutation step (whatever the choice of *) has no effect 
on Forget{r, Dep{a)) since Ep„rget{r,Dep{a)) = 0. This merely means that every- 
thing about the variables concerned with a has been forgotten. Note that, as a 
consequence, r*(a, 0) and r*(-ia, 0) coincide and are equal to Forget{r, Dep{a)). 



Now, when the initial OCF r is given implicitly under the form of a SBB, its 
U-transmutation by (a,i) can be computed without generating r explicitly, in 
both particular cases where * is a conditionalization and an adjustment. 

Proposition 6 Let B be a consistent SBB. Let r be an OCF, a a consistent, 
nontautological formula and i a rank. 

1. if ^ — conditionalization then r^(o,/] — ^Add(BForget(B,Dep(a)),-ia,i)? 

2. if ^ — adjustment and l OO then r^(o,/] — ^ShiftAdd(BForget(B,Dep(a)),-ia,i) 
where ShiftAdd{K, a, i) = (A'l V -<a, ..., Ki-i V -<a, Ki A a, {Kipi V ->a) A (A'l V 
o ) , . . . , ( Kfi V “1 o ] A ( Kfi Vo], Kfi —ipi V o , . . . , Kf, V o , A^oo ) ■ 

Example 4 (Door and window) 

Suppose that initially, the agent knows for sure that the door is open or the win- 
dow IS open, and that normally the door is open. Thus, the initial epistemic state 
rp IS induced by the SBB 

B = (Bi = door-open, Boo = door-openVwindow-open) 

Closing the door amounts to update the epistemic state by the certain piece of 
information -idoor-open. ITe get: (i) BepUar(-idoor-open) ={door-open},' 
(ii) BForget[B , {door-open}) = (T,T),' (iii) r|j (door-open, co) is the OCF 
induced by the SBB (T , -idoor-open). Now, closing the window amounts to 
update the epistemic state by the certain piece of information -iwindow-open, 
i.e., to U-transmute it by (window-open, co); (i) BepUar(-iwindow-open) = 
{window-open},' (ii) BAor^et(B , {window-open}) = (door-open, T),' 

(ill) r|j (window-open, co) is associated to the OCF induced by the SBB 
(-iwindow-open, door-open). Note that whereas we do not know anything more 
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about the window after we closed the door, we still know that the door is normally 
open after we closed the window - which is intended. 

Let us now consider the action “do something with the window” which results 
nondetermimstically in the window being closed or open, none of these results be- 
ing exceptional. We U-transmute the initial belief base by (window-open, 0) (note 
that, obviously, it would work as well with (-iwindow-open,0) (window-open, 0) 
IS the OCF induced by the SBB (door-open, T), 



3.4 Application to reasoning abont action 

When reasoning about action, the formula representing the knowledge about 
the initial state of the world is updated by the explicit changes caused by the 
actions. Now, it is often the case that the possible results of a nondeterministic 
action do not all have the same plausibility. Rather, typical nondeterministic 
actions have, for a given initial state, one or several normal effects, plus one 
or several exceptional effects, with possibly different levels of exceptionality. In 
this case, one has to update the initial belief base by a SBB rather than with a 
single formula. For the purpose of applying U-transmutations to reasoning about 
action, we extend U-transmutations to the case where some of the variables are 
not allowed to be forgotten, because they are static. We hrst need to partition the 
set of literals between static dead dt/namzc variables, i.e., VAR = SV ARUDV AR. 
Static variables are persistent, i.e., their truth value does not evolve. Such a 
distinction is meant to forget only dynamic variables relevant to the update 
(static variables should not be forgotten). These static and dynamic variables 
may depend on the action performed and be specihed together with the action 
description (see [12]). Note that the standard case is recovered when SV AR = 0. 

Definitions The U-transmutation of B by (a,i), w.r.t. the static variables 
SV AR, a dependency relation Dep and a given transmutation is defined by 
r*(a, f)(w) = Forget{r, Dep{a) \ SV AR)*{a, i). 

Example 5 (Saturday night shooting). 

Let us consider the problem mentioned in the introduction. Let the initial epis- 
temic state rs be represented by the SBB 

B = (Bi = aliveA-ihidden, Boo = “idrunk) 

Furthermore, drunk is a static variable: SV AR = {drunk}, which means that 
none of the actions considered in the action model can influence the truth value 
o/ drunk. Updating by the result of the action shoot, namely 
Lp = (drunk— chidden) A (-idrunk—^-ialive) 
gives the following result: oo) is the OCF induced by the SBB 

(-idrunk), (drunk hidden) A(-idrunk—^-ialive)) 
which IS eguivalent to (i.e., induced the same OCF) this other SBB: 

(-idrunk A-ialive, (drunk— chidden) A(-idrunk—^-ialive) . 

Thus, in the final belief state, it is believed (yet with no certainty) that the turkey 
IS dead, which is intended. 
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4 Related work 

Changing epistemic states has been considered many times for belief revision, 
especially when it comes to iteration. In particular, the recent work of Benferhat, 
Konieczny, Papini and Pino-Perez [1] investigates the revision of an epistemic 
state by an epistemic state. 

As to belief update, the closest approach to ours is Boutilier’s generalized 
update [3]. Generalized update is more general than both belief revision and 
belief update. It models epistemic states by OCFs. A generalized update oper- 
ation considers (i) the (explicit) description of the initial epistemic state; (ii) 
the dynamics of a given set of events (each of which having its own plausibility 
rank) expressed by a collection of transition functions mapping an initial and a 
hnal world to a rank; (iii) a formula representing an observation made after the 
evolution of the dynamic system; now, the output consists of the identihcation 
of the events that most likely occurred, a revised initial belief state and an up- 
dated new belief state. In the absence of observations (i.e., when updating by 
T), generalized update merely computes the most likely evolution of the system 
from its dynamic and the initial belief state, which is not far from the goals of 
our approach. The crucial difference is in the way this most likely evolution is 
computed: in [3] epistemic states are represented explicitly (by fully specihed or- 
dinal conditional functions), while in our approach the dynamics of the system 
is represented in a very compact way: requiring that fluents dependent (resp. 
independent) of the input formula be forgotten (resp. remain unchanged) is a 
compact way to encode the dynamics of the system - it is a kind of a solution 
to the frame problem. In further work we plan to integrate observations (as in 
generalized update) in our model, and thus develop an efficient way, based on 
dependence relations, of performing generalized update. 

Another related line of work is [7] who show that Lewis’ imaging operations 
can be viewed as belief updates on belief states consisting of probability dis- 
tributions. They propose a counterpart of imaging to possibility theory. Both 
classes of operations map a belief state and a flat formula to a belief state, and 
they are based on minimization. 
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Fast Text Classification Using Sequential 
Sampling Processes 
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Abstract. A central problem in information retrieval is the automated 
classification of text documents. While many existing methods achieve 
good levels of performance, they generally require levels of computation 
that prevent them from making sufficiently fast decisions in some applied 
setting. Using insights gained from examining the way humans make fast 
decisions when classifying text documents, two new text classification al- 
gorithms are developed based on sequential sampling processes. These 
algorithms make extremely fast decisions, because they need to examine 
only a small number of words in each text document. Evaluation against 
the Reuters-21578 collection shows both techniques have levels of per- 
formance that approach benchmark methods, and the ability of one of 
the classifiers to produce realistic measures of confidence in its decisions 
is shown to be useful for prioritizing relevant documents. 



1 Introduction 

A central problem in information retrieval is the automated classification of 
text documents. Given a set of documents, and a set of topics, the classification 
problem is to determine whether or not each document is about each topic. This 
paper presents two fast text document classifiers inspired by the human ability 
to make quick and accurate decisions by skimming text documents. 



2 Existing Methods 

A range of artificial intelligence and machine learning techniques have been ap- 
plied to the text classification problem. A recent and thorough evaluation of five 
of the best performed methods is provided in [1]. The classifiers examined are: 
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— Support Vector Machines (SVM), which use a training set to find optimal 
hyperplanes that separates documents into those about a topic, and those not 
about a topic. These hyperplanes are then applied to classify new documents. 

— k-Nearest Neighbor classifiers (kNN), which classify new documents accord- 
ing to the known classifications of its nearest training set neighbors. 

— Linear Least Squares Fit classifiers (LLSF), which generate a multivariate 
regression model from a training set that can be applied to new documents. 

— Neural Network classifiers (NNet), which learn the connection weights within 
a 3-layer neural network using a training set, and then applies this network 
to classify new documents. 

— Naive Bayes classifiers (NB), which use a training set estimate the proba- 
bilities of words indicating documents being about topics, and uses a simple 
version of Bayes theorem with these probabilities to the classify new docu- 
ments. 

Different performance measures show different levels of relative performance 
for the five classifiers, although the SVM and kNN are generally the most ef- 
fective, followed by the LLSF, with the NNet and NB classifiers being the least 
effective [1]. What is important, from an applied perspective, is the considerable 
degree of computation undertaken by each classifier, either during the training 
process, the process of classifying new documents, or both. SVMs, for example, 
require the solution to a quadratic programming problem during training, LLSF 
classifiers must solve a large least-squares problem, and NNets are notoriously 
time consuming to train. 

In classifying new documents, most existing techniques consider every word 
in the document, and often have to calculate involved functions. This means that 
they take time to process large corpora. In many applied situations, analysts re- 
quire fast ‘on-line’ text document classification, and would be willing to sacrifice 
some accuracy for the sake of timeliness. The aim of this paper is to develop 
text classifiers that emphasize speed rather than accuracy, and so the results in 
[1] are used as guides on acceptable performance, rather than benchmarks to be 
exceeded. 

3 Some Insights from Psychology 

As with many artificial intelligence and machine learning problems, there is 
much to be learned from examining the way in which humans perform the task 
of text classification. In particular, it is worth making the effort to understand 
how people manage to make quick and accurate decisions regarding which of the 
many text documents they encounter every day (e.g., newspaper articles) are 
about topics of interest. 

3.1 Bayesian Decision Making 

A first psychological insight involves the relationship between the decisions “this 
document is about this topic” and “this document is not this topic”. When 
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people are asked to make this decision, they actively seek information that would 
help them make either choice. They do not look only for confirming information 
in the hope of establishing that the document is about the topic, and conclude 
otherwise if they fail to find enough information. 

For example, if people are asked whether a newspaper article is about the 
US Presidential Elections, consider the following three scenarios: 

— The first word is “The” . In this case, most people would not be able to make 
any decision with any degree of confidence. 

— The first word is “Cricket”. In this case, most people would confidently 
respond ‘No’. 

— The first word is “Gore” . In this case, most people would confidently respond 
‘Yes’. 



The fact that people are able to decide to answer ‘No’ in the second scenario 
suggests that they are actively evaluating the word as evidence in favor of the 
document not being about the topic (in the same way they actively evaluate 
the word “Gore” in the third scenario) . This behavior suggests that people treat 
the choices “this document is about this topic” and “this document is not this 
topic” as two competing models, and are able to use the content of the docu- 
ment, in a Bayesian way, as evidence in favor of either model. Many established 
text classifiers, including the kNN, LLSF and NNet classifiers, do not operate 
this way. In general terms, these classifiers construct a measure of the similarity 
between the document in question, and some abstract representation of the topic 
in question. When the measure of similarity exceeds some criterion value, the 
decision is made that the document is about the topic, otherwise the default 
decision is made that the document is not about the topic. The text classifiers 
developed here, however, actively assesses whether the available information al- 
lows the decision “this document is not about this topic” to be made. Adopting 
this approach dramatically speeds text classification, because it is often possible 
to determine that a document is not about a topic directly, rather than having 
to infer this indirectly from failing to establish that it is about the topic. 

At the heart of the Bayesian approach are measures of the evidence individual 
words provide for documents either being about a topic, or not being about a 
topic. The evidence that the i-th word in a dictionary provides about topic T, 
denoted by Vt {wi), may be calculated on a log-odds scale as follows: 



Ur (wi) = In 



P {m I T) 
P I t) 



k.€T|/|T| 
rcj G T / T 



where T is “about a topic”, T is “not about a topic”, \wi G T| is the number of 
times word Wi occurs in documents about topic T, and |T| is the total number 
of words in documents about topic T. Note that these evidence values are sym- 
metric about zero: Words with positive values (e.g., “Gore”) suggest that the 
document is about the topic, words with negative values (e.g., “cricket”) suggest 
that the document is not about the topic, and words with values near zero (e.g., 
“the”) provide little evidence for either alternative. 
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3.2 Non-compensatory Decision Making 

A second psychological insight is that, when people decide whether or not a 
text document is about a topic, they often make non-compensatory decisions. 
This means that people are able to make a decision without considering all of 
the content of a document. For example, if asked whether a newspaper article 
is about the US Presidential Elections, and the first 11 words read are “Alan 
Border yesterday questioned the composition of the Australian cricket team 
...”, most people would choose to answer ‘No’, even if they were permitted to 
examine the remainder of the article. In making non-compensatory decisions. 




Fig. 1. The mean absolute evidence provided by words in the Reuters-21578 Corpus, 
as a function of their relative position in the document. 



people rely on regularities in their environment [2] . In the case of text documents, 
they assume that words near the beginning will provide some clear indication of 
the semantic topic. This assumption is borne out by the analysis of the entire 
Reuters-21578 collection presented in Figure 1, which shows the mean absolute 
evidence provided by words according to their relative position in the documents. 
Words at the beginning of documents provide relatively more evidence than those 
in the middle or near the end, although there is a small increase for words at 
the very end, presumably associated with the ‘summing up’ of documents. The 
important point, for the purposes of fast text classification, is that it is possible 
to know a priori those words in a document that will be the most useful for 
making a decision. Figure 1 suggests that, at least for news-style documents, 
they will be words at or near the beginning of the document. 
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3.3 Complete Decision Making 

A final psychological insight is that when people decide whether or not a text 
document is about a topic, they undertake a decision process that generates 
more information than just a binary choice. People give answers with a certain 
level of accuracy, having taken a certain period of time, and are able to express 
a certain level of confidence in their decision. An automatic text classification 
system capable of providing the same sort of response outputs seems likely to 
have advantages in many applied situations. 



4 Sequential Sampling Process Models 

Within cognitive psychology, the most comprehensive accounts of human deci- 
sion making are provided by sequential sampling models. In particular, a number 
of ‘random-walk’ and ‘accumulator’ models have been developed, and demon- 
strated to be successful in a variety of experimental situations [3,4]. These models 
are based on the notion of accruing information through the repeated sampling 
of a stimulus, until a threshold level information in favor of one alternative has 
been collected to prompt a decision. 

Both random walk and accumulator models naturally capture the three 
psychological insights into the text classification problem. Both models use a 
Bayesian approach to model selection, in the sense that they establish explicit 
thresholds for both of the possible decisions. The use of thresholds also means 
that non-compensatory decisions are made, since the stimulus is only examined 
until the point where the threshold is exceeded. Furthermore, by examining the 
words in a text document in the order that they appear in the document, those 
words that are more likely to enable a decision to be made will tend to be pro- 
cessed first. Finally, both models are able to generate measures of confidence in 
their decisions. 

This integration of the psychological insights suggests text classifiers that 
operate by examining each word in a text document sequentially, evaluating 
the extent to which that word favors the alternative decisions “this document 
is about the topic” and “this document is not about the topic”, and using the 
evidence value to update the state of a random-walk or accumulator model. 

4.1 Random Walk Text Classifier 

In random walk models, the total evidence is calculated as the difference between 
the evidence for the two competing alternatives, and a decision is made once it 
reaches an upper or lower threshold. This process can be interpreted in Bayesian 
terms [5], where the state of the random walk is the log posterior odds of the 
document being about the topic. Using Bayes’ theorem, the log posterior odds 
are given by 

p(T|D) _ p(T)p(D|T) 

p (f I D) p (T) p (D I f ) ’ 
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where D is the document being classified in terms of topic T. Assuming the 
document is appropriately represented in terms of its n words wi,W 2 , - ■■ , 
which is probably the most justifiable assumption, although it is certainly not 
the only possibility, this becomes 

p(T I D) _ p(T) p{wi,W 2 ,...,Wn I T) 

p(T|D) p{T) p{wi,W 2 ,...,Wn\T) 



If it is further assumed that each word provides independent evidence, which is 
more problematic, but is likely to be a reasonable first-order approximation, the 
log posterior odds becomes 



P(T|D) 

(T|D) 



= In 



P(T) 

P(T) 



I T) 

p{w,\T) 



pK I T) 

p{w2\T) 



-I- ... -I- In 



PjWn I T) 
P {Wn I T) 



= In 



P(T) 

P(T) 



-|- Vt ('Wl) -|- Vt (^ 2 ) Vt (vJn) ■ 



This final formulation consists of a first ‘bias’ term, given by the log prior odds, 
that determines the starting point of the random walk, followed by the summa- 
tion of the evidence provided by each successive word in the document. 

Once a random walk has terminated, and a decision made according to 
whether it reached an upper or lower threshold, a measure of confidence in 
the decision can be obtained as an inverse function of the number of words ex- 
amined. For documents that require many words to classify, confidence will be 
low, while for documents classified quickly using few words, confidence will be 
high. 

Figure 2 summarizes the operation of the random walk classifier on a docu- 
ment from the Reuters-21578 collection that is about the topic being examined. 
The state of the random-walk is shown as the evidence provided by successive 
words in the document are assessed. A threshold value of 50 is shown by the 
dotted lines above and below. This example highlights the potential of non- 
compensatory decision making, because the first 100 words of the documents 
allow the correct decision to be made, but the final state of the random-walk, 
when the entire document has been considered, does not lead to the correct 
decision being made. 



4.2 Accumulator Text Classifier 

The accumulator text classifier is very similar to the random walk version, ex- 
cept that separate evidence totals are maintained, and a decision is made when 
either one of them reaches a threshold. This means that evidence provided by 
each successive word Vt (wi) is added to the “is about topic” accumulator At 
if it is positive, and to the “is not about accumulator” At if it is negative. 
Once either At reaches a positive threshold, or At reaches a negative thresh- 
old, the corresponding decision is made. The confidence in this decision is then 
measured according to the difference between the evidence totals accumulated 
for each decision, as a proportion of the total evidence accumulated, as follows: 
(At - |Ax|) / (At -I- |At|). 




Fast Text Classification Using Sequential Sampling Processes 



315 




Fig. 2. Operation of the random walk text classifier in a case where the document is 
about the topic in question 



5 Evaluation against Reuters-21578 

5.1 Standard Information Retrieval Measures 

The random walk and accumulator classifiers were evaluated using the ModApte 
training/test split detailed in [6] to enable comparison with the benchmark re- 
sults presented in [1]. In the interests of ensuring speed, the corpus was not 
pre-processed to the same extent as [1]. In particular, no word-stemming was 
undertaken. The only pre-processing was to filter the documents into lower case 
characters {a., .z} together with the space character. The performance of the 
text classifiers was measured in five standard ways [7]: recall, precision, macro 
FI, micro FI, and error rate. 

Precision, p, measures the proportion of documents the classifier decided were 
about a topic that actually were about the topic. Recall, r, measures the pro- 
portion of documents actually about a topic that were identified as such by the 
classifier. The two versions of the FI measure, FI = 2rpj {r + p) were obtained 
by different forms of averaging. The first was obtained by ‘micro-averaging’, 
where every decision made by the classifier was aggregated before calculating re- 
call and precision values. The second was obtained by ‘macro-averaging’, where 
recall and precision values were calculated for each topic separately, and their 
associated FI values were then averaged. As argued in [1], it is important to 
consider both approaches when using corpora, such as Reuters-21578, where 
the distribution of topics to documents is highly skewed. The error was simply 
measured as the percentage of incorrect decisions made by the classifier over all 
document-topic combinations. These measures were based on modified ‘forced- 
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choice’ versions of the random walk and accumulator classifiers, where a decision 
was made even when no threshold had been reached at the end of the document. 
For the random walk classifier, this decision was made on the basis of whether 
the final state was positive or negative. For the accumulator, the larger of the 
two accumulated totals was used to make a decision. 




Fig. 3. Precision-recall performance of the random walk and accumulator text classi- 
fiers, together with existing benchmarks. 



Figure 3 shows the precision and recall performance of the random walk and 
accumulator text classifiers for a range of different threshold values, together 
with the benchmark performances reported in [1]. While different applied set- 
tings can place different degrees of emphasis on recall and precision, the best 
balance probably lies at about the threshold value of 25. In terms of the existing 
benchmarks, both classifiers have competitive recall performance, but fall short 
in terms of precision. In practical terms, this means that the random walk and 
accumulator classifiers find as many relevant documents, but return 3 or 4 ir- 
relevant documents in every batch of 10, whereas benchmark performance only 
return 1 irrelevant document in every batch of 10. 

Figure 4 shows the micro FI and macro FI performance of the random walk 
and accumulator text classifiesr for the threshold values up to 25, together with 
the benchmarks. On these measures, both classifiers are very competitive and, 
in fact, outperform some existing methods on the macro FI measure. 
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Fig. 4. Micro and Macro FI performance of the random walk and accumulator text 
classifiers, together with existing benchmarks. 



Table 1. Mean number of words examined, mean percentage of words examined, and 
mean percentage error of the forced choice random walk and accumulator text classi- 
fiers. 



Random Walk Accumulator 



Threshold Words Percentage Error Words Percentage Error 



0 


1.06 


0.8% 


18.5% 


1.06 


0.8% 


18.5% 


1 


1.59 


1.3% 


8.1% 


1.54 


1.3% 


8.5% 


2 


1.99 


1.7% 


4.7% 


1.88 


1.6% 


5.4% 


5 


3.45 


2.9% 


1.8% 


2.96 


2.5% 


2.4% 


10 


6.72 


5.6% 


1.1% 


5.48 


4.6% 


1.6% 


25 


15.7 


13.1% 


1.0% 


12.8 


10.7% 


1.2% 


50 


29.0 


24.2% 


1.0% 


24.9 


20.8% 


1.1% 


75 


40.4 


33.6% 


1.0% 


35.9 


29.9% 


1.1% 


100 


49.9 


41.6% 


1.0% 


45.3 


37.8% 


1.1% 


200 


75.1 


62.6% 


1.0% 


71.2 


59.4% 


1.1% 
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Table 1 presents the mean number of words examined by each of the classi- 
fiers at each threshold, this mean count as a percentage of the average document 
length of the test set, and percentage error of the classifiers. It is interesting to 
note that the accumulator classifier generally requires fewer words than the ran- 
dom walk classifier. More importantly, these results demonstrate the speed with 
which the classifiers are able to make decisions. At a threshold of 25, only 10-13% 
of the words in a document need to be examined on average for classification 
at a 1% error rate. Given the computational complexity of existing methods, 
it seems reasonable to claim that the random walk and accumulator classifiers 
would have superior performance on any ‘performance per unit computation’ 
measure. 

5.2 Confidence and Prioritization 




Fig. 5. Confidence distributions for the forced-choice version of the accumulator clas- 
sifier. 



For the forced choice versions of the classifiers, it is informative to examine 
the distribution of confidence measures in terms of the standard signal detection 
theory classes of ‘hit’, ‘miss’, ‘correct rejection’ and ‘false alarm’. These distri- 
butions are shown at a threshold of 25 for the accumulator classifier in Figure 
5, and for the random walk classifier in Figure 6. The measures of confidence 
generated by the accumulator are meaningful, in the sense that hits and correct 
rejections generally have high confidence values, while misses and false alarms 
generally have low confidence values. The random walk confidence measures, in 
contrast, do not differ greatly for any of the four decision classes and, in fact, 
the classifier is generally more confident when it misses than when it hits. 
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Fig. 6. Confidence distributions for the forced-choice version of the random walk clas- 
sifier. 



The ability of accumulator models to provide more realistic confidence mea- 
sures than random walk models has been observed within psychology [4], and has 
practical implications for text classification. In particular, the confidence mea- 
sures can be used as ‘relevancy’ scores to order or prioritize the decisions made 
by the classifiers. The obvious way of doing this is to return all of the documents 
that were classified as being about topics first, ranked from highest confidence 
to lowest confidence, followed by the documents not classified as being about 
topics, ranked from lowest confidence to highest. 

This prioritization exercise was undertaken for both of the classifiers on all 
of the possible document-topic combinations, and the results are summarized by 
the ‘effort-reward’ graph shown in Figure 7. The curves indicate the proportion 
of relevant documents (i.e., the reward) found by working through a given pro- 
portion of the prioritized list (i.e., the effort). It can be seen that both classifiers 
return almost 90% of the relevant documents in the first 5% of the list, but that 
the accumulator then performs significantly better, allowing all of the relevant 
documents to be found by examining only the top 20% of the list. 



6 Conclusion 

This paper has presented two text classifiers based on sequential sampling models 
of human decision making. Both techniques achieve reasonable levels of perfor- 
mance in comparison to established benchmarks, while requiring minimal com- 
putational effort. In particular, both classifiers are capable of making extremely 
fast decisions, because they generally need to examine only a small proportion 
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Fig. 7. Effort-reward performance for priorization using the forced-choice accumulator 
and random walk classifiers. 



of the words in a document. The ability of the accumulator classifier to gener- 
ate meaningful confidence measures has also been demonstrated to be useful in 
presenting prioritized lists of relevant text documents. 
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Abstract. In order to function effectively, agents, whether human or 
software, must be able to communicate and interact through common 
understandings and compatible conceptualisations. In a multi-cultural 
world, ontological differences are a fundamental obstacle that must be 
overcome before inter-cultural communication can occur. The purpose of 
this paper is to discuss the issues faced by agents operating in large-scale 
multi-cultural environments and to argue for systems that are tolerant 
of heterogeneity, illustrating the discussion with a running example of 
researching and comparing university web sites as a realistic scenario 
representative of many current knowledge management tasks that would 
benefit from agent assistance. We then discuss the efforts of the Intelli- 
gent Agent Laboratory toward designing such tolerant systems, giving a 
detailed presentation of the results of several implementations. 



“In an ill-structured domain you cannot, by definition, have a pre-compiled 
schema in your mind for every circumstance and context you may find ... you 
must be able to flexibly select and arrange knowledge sources to most efficaciously 
pursue the needs of a given situation. ” [8] 

1 The Reality of Distributed Knowledge Systems 

That useful knowledge systems inevitably incorporate vast amounts of informa- 
tion is becoming a generally acknowledged phenomenon. The evolution of the 
computer as a data processing device, and computer networks as communica- 
tion media, has provided the technical means to aggregate enormous quantities 
of information. Similarly acknowledged is that our capacity for accumulation, 
storage and reproduction of data and information has out-paced our ability to 
perceive and manipulate knowledge. This is not a new realisation; Vannevar 
Bush identified just such a glut of knowledge and information over fifty years 
ago and proposed a technological solution in the form of the memex, an enlarged 
intimate supplement to memory that anticipated the hypertext systems of to- 
day [3] . The need for contextualising data remains thoroughly applicable to the 
World Wide Web and other large-scale information networks. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 321—332, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




322 K. Lister and L. Sterling 



By implementing a (pseudo)global communication infrastructure that pro- 
vides means for the publication, comparison and aggregation of apparently lim- 
itless amounts of data, we have discovered the potential to ask questions as in- 
dividuals conducting our daily lives that previously would have been dismissed 
as infeasible for anyone less than a dedicated organisation. For example, with 
the entry cost of publishing a web site effectively negligible, the university that 
does not do so is the exception rather than the rule. This means that dozens, 
if not hundreds, of descriptions of courses, programs and facilities are available 
for us to peruse. As we learn this, we immediately see a possibility for compari- 
son, and want to ask reasonable and seemingly simple questions such as “Which 
faculties offer courses in applied machine vision?” or “Which campuses provide 
accommodation facilities for post-graduate students?” . 

To answer questions like these, we could fairly easily compile a list of uni- 
versity web sites; the list might even be complete. We could then visit each site 
in turn, browsing or searching and recording what information we think will 
answer our question. Finally, we could compare the results of our research from 
each site to formulate an answer. Many people perform this very task every day. 
The question that interests this paper is why our computers can’t do this for us 
yet, and how we can approach the issue of enabling them to do so. The example 
of university service descriptions is an appropriate one for the purposes of this 
paper, as the issues described can be readily seen to be present in the real world. 
Additionally, universities as institutions tend naturally to develop and often then 
actively promote their individuality; this local culture flavours their presentation 
of information that must then be reconciled with information from other insti- 
tutions that apply their own cultural characteristics to their publications. If we 
are to manage knowledge from a variety of sources effectively, we will need the 
assistance of software that is culturally aware and is capable of negotiating the 
conflicts that arise when such heterogeneous knowledge is juxtaposed. 

2 How Organisational Cnlture Affects Communication 

The reality of distributed information systems is an environment in which knowl- 
edge from large numbers of heterogeneous sources must be integrated in such 
a way that we can efficiently reconcile any differences in representation and 
context in order to incorporate foreign knowledge into our own world-view. To 
be able to work with knowledge from incongruous sources is becoming increas- 
ingly necessary [15] as the focus of information processing moves beyond intra- 
organisational interaction and begins to transgress borders, whether departmen- 
tal, corporate, academic or ethnic. 

Organisational cultures arise as individual organisations develop mechanisms, 
procedures and representations for dealing with the issues that they face. In- 
evitably, because these cultures are generally developed in isolation, each organ- 
isation will invariably arrive at different solutions to what are often very similar 
problems. In order to stream-line organisational activities and focus group ef- 
forts on a common goal, it is necessary for individuals to internalise their own 
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personal intuited approach to a situation in lieu of an agreed common under- 
standing shared by the other members of the group. We do this naturally when 
we work together on a problem; some are more able than others, and we recog- 
nise teamwork and the ability to understand another’s point of view as desirable 
qualities. Such qualities are also becoming desirable in software as agents play 
an increasing role in our communication and collaboration. 

When we suppress our own intuitive understanding of a situation and at- 
tempt to adopt a standardised, agreed upon approach, we increase our ability to 
interact with others who have similarly adapted their individual understanding 
to that of the group or community. But we also lose something in the process: 
context and generality. An efficient understanding of a situation is like a model, 
in that the more closely it describes a particular situation, the less effectively it 
describes a general class of situations. Additionally, as we move from a general 
conceptualisation of a situation rich with semantic flexibility to a specific under- 
standing, we tend to eschew context. We do this because the very generality that 
gives us the ability to deal with many varied and new situations is a barrier to 
communication; at the same time that ambiguity allows adaptation, it prohibits 
individuals from establishing the certainty of agreement that is necessary for 
confidence that each understands the other. 

However, as organisations discover, standardisation of practices and under- 
standings does not create a panacea for the difficulties of communication and 
collaboration. On a small scale, adoption of standardised approaches helps indi- 
viduals to cooperate and achieve goals too large for a single person. On a larger 
scale, the effort required to establish and prescribe global standards and common 
approaches grows rapidly beyond feasibility as the number of participants and 
the amount of data being manipulated increases. As our ability to communicate 
and interact across cultural borders increases, so does our desire to do so. And 
as we come to terms with the necessities of increased interoperation and develop 
coping strategies, if our software tools are to scale similarly we must provide 
them with equivalent reconciliation capabilities. 



3 Our Software Colleagues 

In many respects, computers are an extreme example of co-workers with poor 
teamwork and communication skills. When specifying a task for a software ap- 
plication or agent we must specify every step in precise detail, detail that will 
generally remain constant throughout the life of the software. Whilst humans 
are able to adjust the level of abstraction at which they conceptualise a partic- 
ular situation, computers traditionally have the capacity only for comparatively 
very low levels of abstraction. As machines that follow explicit instructions to 
the letter, their operation is analogous to the most procedural organisational 
standards, and unsurprisingly they too have great difficulty adapting to new 
situations. 

Traditional computational paradigms require that computer-mediated rep- 
resentations of information and knowledge be exact and literal; for a computer 
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to process information requires simplistic structuring of data and homogeneous 
representations of concepts. In order to maintain consistency during processing, 
traditional approaches require that each participant in a system, whether human 
or software, subscribe to a common understanding of the concepts within the 
system. In other words, traditional information systems require the adoption of 
an absolute ontological world-view; deviation from a priori agreed terms and 
understandings results in a breakdown in communication and loss of consistency 
through the system. 

This ontological homogeneity has worked well for systems with little direct 
human interaction, when the computers can be left to sort out technical details 
and humans can work at a level removed from the coal face. In fact, isolating 
technically detail areas of a system from those areas with which humans interact 
permits engineering of the technical aspects to create an optimised environment. 
The World Wide Web is an example of a large-scale system in which the level 
at which humans interact with the system is greatly separated from the level at 
which machines interact. We write web pages and read them, navigating along 
hypertextual paths, while machines manage domain name resolution, protocol 
selection, transmission of data and rendering of text and images. The gap be- 
tween the activities of humans and machines is highlighted by the problems that 
occur when we try to make machines work closer to our level as we attempt to 
automate various functions that we currently perform manually. The example 
of this most recognisable to the ordinary web user is the task of searching for 
information, an obviously difficult problem that has yet to be solved to our satis- 
faction. But a more far-reaching problem is that of integrating the vast quantities 
of information available in such a way that we can seamlessly assimilate what- 
ever sources of data are most appropriate to the task at hand, whatever that 
task may be. 

4 Automating Conceptualisation 

Automation of data processing is desirable because it frees humans from the 
morass of detail and permits them to utilise their capacity for abstraction. The 
ability to manipulate concepts at varying levels of detail and to match the level 
of detail to the needs of the situation at hand is one of our most effective tools for 
processing knowledge and communicating. Being able to subsume detail within 
conceptual units of knowledge allows us to overcome the natural limits of our 
processing capacity; although there appear to be clear cognitive limits on the 
number of concepts we can articulate at any given time, we have the critical 
ability to ‘chunk’ collections of knowledge into single units [11,5], effectively pro- 
viding a capacity to search through information webs both widely and deeply as 
necessary. Similarly, when the scope of an information or data problem becomes 
too great for us to process in a reasonable amount of time, we bring computers 
to bear on the problem to assist us with storage, recall and simple processing. 
Automation of data processing provides increased speed and accuracy, and also 
permits the not insignificant relief of boredom resulting from repetitive tasks. 
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By handing low-level information processing tasks to machines, humans are 
freed to consider issues at higher levels of abstraction. If we are to continue to 
advance the level of assistance that our computers can provide to us as we work, 
we must elevate our tools to higher levels of abstraction to accommodate the 
ever-increasing complexity of the situations we face. 

As knowledge travels through progressively lower levels of abstraction, its 
context degrades as generality is replaced by specificity and logical operability. 
Humans require some specification in order to communicate successfully, but 
the desired degree of consistency of conceptualisations determines the extent 
of specification that is necessary. Indeed, it is suggested that even consensus 
between participants is not always necessary for successful collaboration [1,12]. 
As discussed earlier, one of our greatest strengths as humans is our ability to 
adapt to new situations and reconcile new ontological concepts with our own 
history of previous experiences. We are also capable of identifying mismatches 
of understanding in our communications and negotiating shared perspectives as 
we interact with others [2]. We use the term ontological reconciliation for the 
process of resolving conceptual differences. Human natural language is neither 
precise nor predictable, and this seems to reflect the way that we understand 
the world though our internal representations and conceptualisations. When we 
express ourselves in natural language, we often encounter confusion and diffi- 
culty as others attempt to understand us. This requires us to explore alternative 
expressions, searching for representations that others understand. We do this 
naturally, and our attention is drawn to the process only when it fails. But we 
are generally capable of finding enough common ground for communication of 
knowledge to proceed; we are often even able to convey basic information with- 
out a common language, as any tourist who has managed to gain directions to 
a restaurant or train station with much waving of hands can attest. 

Fitting knowledge to logical representations is a subjective process. Decisions 
must be made about how to express complex concepts in relatively constrained 
languages; these decisions are made by people whose choices of representation 
and expression are influenced by their own cultural background. Consequently, as 
context is lost problems then arise as other organisations with different cultures, 
or even just individuals with different conceptualisations, attempt to understand 
the logical representation and rebuild the original knowledge. 

To return to the case of university web sites, it seems reasonable to assume 
that all universities partake in the teaching of students and in research. Most 
universities offer undergraduate degrees in the areas of engineering, arts, science 
and commerce. But when it comes to describing their activities, where one uni- 
versity may use the word course to refer to a particular degree program, another 
will use course to mean an individual subject within a degree; a third institu- 
tion may use course to describe a particular stream or program within a degree. 
Some institutions will say unit where others say subject and others say class. 
Simply due to their own individual organisational cultures, different institutions 
use different vocabularies to describe their activities. The researcher wishing 
to compare the services provided by different universities will generally quickly 
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identify the differences and through an understanding of the knowledge domain 
concerning university activities and services will be able to translate between 
terms, usually assimilating them into the researcher’s own personal ontological 
understanding, which itself will be shaped by their personal experiences (if they 
are from a university that uses course to mean a unit of teaching and program to 
describe an undergraduate degree, they will probably translate the descriptions 
from other institutions into this ontology - if they are not from a particular uni- 
versity, they will probably draw on whatever experience they have of academic 
institutions, and if they have none, they may build their own ontology from the 
collection of university representations). 

To create software agents that can handle this level of ontological complexity 
would seem to be very difficult. Why then is it preferable to simply agreeing 
upon a global ontology to which all agents subscribe, a centralised language of 
understanding and representation, or even a global directory of multiple re-usable 
ontologies from which agents select as necessary? Ontology creation itself is very 
difficult. It requires the ability to define many concepts precisely and consistently. 
It requires the ability to predict appropriate assumptions and generalisations 
that will be acceptable to most, if not all, people. It also requires universal access 
and distribution infrastructure, and a well-established and accepted knowledge 
representation format. It requires some way to address the desire for agents 
and humans to interact at variable levels of abstraction as particular situations 
demand. It requires constant maintenance to ensure freshness and currency, yet 
also must provide backward compatibility for old agents. It requires that agent 
developers familiarise themselves with the prescribed knowledge representation 
formats, ontologies and protocols and adapt their own development efforts to suit 
them. These issues make a global ontology infrastructure unsuitable as the sole 
approach, and it is our belief that effort spent adding tolerance of heterogeneity 
to systems will provide greater benefit as we begin to introduce agents to our 
multi-cultural world. 

In addition to the practical benefits, one of our strongest desires for tolerance 
of heterogeneity for software systems is rooted unashamedly in idealism: humans 
manage to resolve ontological differences successfully, in real time and ‘on the 
fly’. This ability gives us much flexibility and adaptability and allows us to 
specialise and optimise where possible and yet generalise and compromise when 
necessary. Therefore, it seems both feasible and desirable to have as a goal a 
similar capability for software agents. 

If we are to make effective use of multi-cultural data from heterogeneous 
sources, we need ways and means to reconcile the differences in representation. 
If we are to work efficiently to solve large information problems, we need the 
assistance of automated mechanisms. To achieve both, we need systems that are 
tolerant of heterogeneity. 

Reconciling ontological differences requires understanding the difference be- 
tween concepts and their representations; in semiotic terms, appreciating the 
difference between the signified and the signifier. Reconciling ontological differ- 
ences means reading multiple texts that represent identical, similar or related 
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concepts and being able to work with them at the concept level rather than at 
the level of representation. 

For an XML documents or databases, it might be as simple as realising 
that two fields in different data sources actually contain the same class of data. 
On the other hand, it might be as complex as deciding that articles from an 
economics magazine and an automotive magazine are discussing different topics 
even though they both have ‘Ford’ and ‘analysis’ in their titles, something that 
current search technologies would be unlikely to realise. 

As the number of data sources available to us and our ability to access them 
on demand and in real time is increasing, the overhead of pre-constructing a 
complete ontology for a given interaction becomes less and less viable. Large 
scale interconnectedness and increased frequency of data transactions across or- 
ganisational and cultural borders leads to a reduction in the useful life of any 
context constructed for a particular transaction. Just as we are able to establish 
contexts and construct suitable local ontologies as needed for particular inter- 
actions, if we want to be able to include software agents in our higher level 
communication and knowledge management, they will need to be capable of 
similar conceptualisation. 



5 Results and Thoughts from the Intelligent Agent 
Laboratory 

The Intelligent Agent Laboratory at the University of Melbourne has been work- 
ing for a number of years on knowledge representation and manipulation for 
information agents [13,14]. When considering how best to structure knowledge 
for information agents, two questions arise: what types of knowledge should be 
pre-defined and what should be left to be learned dynamically? The work of 
the Intelligent Agent Laboratory addresses these questions in both theory and 
practice; the remainder of this paper describes two recent projects. 

5.1 CASA 

Classified Advertisement Search Agent (CASA) is an information agent that 
searches on-line advertisements to assist users in finding a range of informa- 
tion including rental properties and used cars [4] . It was built as a prototype to 
evaluate the principle of increasing the effectiveness and flexibility of informa- 
tion agents while reducing their development cost by separating their knowledge 
from their architecture, and discriminating between different classes of knowl- 
edge in order to maximise the reusability of constructed knowledge bases. CASA 
is able to learn how to interpret new HTML documents, by recognising and un- 
derstanding both the content of the documents and their structure. It also rep- 
resents a framework for building knowledge-based information agents that are 
able to assimilate new knowledge easily, without requiring re-implementation or 
redundant development of the core agent infrastructure. In a manner that draws 
on similar principles to object-oriented analysis and design methodologies and 
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component-based development models, an agent shell developed from CASA [9] 
allows simple construction of agents that are able to quickly incorporate new 
knowledge bases, both learnt by the agent itself and incorporated from external 
sources. 

CASA classifies knowledge into three categories: general knowledge, domain 
specific knowledge and site or source specific knowledge. Each category is in- 
dependent from the others, and multiple instances of each category can exist. 
General knowledge gives a software agent enough information to understand and 
operate in its environment. General knowledge is knowledge that is true for all 
information sources, and is independent of specific domains and sites. The set 
of general knowledge developed for CASA describes on-line web documents, and 
includes knowledge of the components that make up an HTML document such 
as what are tables, paragraphs and lines, as well as knowledge of what a web 
page is and how one can be accessed. 

Domain specific knowledge provides an information agent with a basic un- 
derstanding of the area in which is required to work. This knowledge is true for 
a particular field and is independent of site or source specifics. For the case of 
university services, domain knowledge would generally include the concepts of 
students, lectures, theatres, semesters, professors and subjects, as well as on- 
tological relationships such as the idea that students take classes, classes cover 
particular topics and occur at certain times during the week at certain locations, 
and that particular subjects make up a course. Because domain knowledge is in- 
dependent of site specific knowledge, it can be re-used across numerous sites and 
should remain useful into the future. 

Site specific knowledge is true for a particular information source only. Site 
knowledge is specific and unique, but necessary for negotiating the contents of 
a particular information source; it provides a means of understanding the basic 
data that comprise an information source, for a particular representation. Con- 
tinuing the university web site example, site specific knowledge might encode the 
particular pattern or format in which a certain institution presents a description 
of a unit of teaching, or of a degree, including information such as table struc- 
tures, knowledge unit sequences and marker text that locates certain classes of 
information. 

The three categories of knowledge that CASA manages provide different 
levels of operational assistance for the information agent. General knowledge 
enables an agent to act and interact in a particular environment, providing the 
basis for navigation and perception and giving the agent a means by which to 
internalise its input. Site specific knowledge permits an agent to assimilate and 
process information from a particular source, which is a necessary ability if the 
agent is to perform useful tasks. Domain specific knowledge sits between general 
and site specific knowledge, giving a conceptual framework through which an 
agent can reconcile information from different sources. Domain specific knowl- 
edge can also assist an agent to negotiate unfamiliar information sources for 
which it has no site specific knowledge. Domain knowledge can be used in con- 
junction with general knowledge to analyse a site’s conventions and representa- 
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tions and to attempt to synthesise the site knowledge necessary to utilise the 
new information source. Because domain knowledge is not tied to a particular 
representation, it can be adapted and applied to a variety of different sites or 
data sources, significantly reducing development time for information agents. 

5.2 AReXS 

Automatic Reconciliation of XML Structures (AReXS) is a software engine that 
attempts to reconcile differences between XML structures that encode equivalent 
concepts. It is able to identify differences of expression and representation across 
XML documents from heterogeneous sources without any predefined knowledge 
or human intervention [6]. It requires no knowledge or experience of the do- 
main in which it works, and indeed is completely domain independent. It uses 
Example-Based Frame Matching (EBFM) [7] and is able to achieve very high 
recall with modest precision on real world data collected from commercial web 
sites. 

By requiring no domain knowledge, AReXS is suitable for application to any 
field; its success relies on its ability to identify and resolve the differences in repre- 
sentation that result from sourcing data from a multi-cultural environment. For 
example, a pair of XML documents from different sources, both describing ser- 
vices offered by universities, might contain attributes named SUBJECT and UNIT 
respectively. If the two attributes happen to both signify self-contained units of 
course work, an agent with no prior domain experience or knowledge will have 
little hope of realising this. AReXS resolves this discontinuity by considering the 
values of instances of the attributes as well as the attribute names, deriving confi- 
dence in a match from similarities in either comparison. If one document contains 
the statement <SUBJECT> Introductory Prograinming</SUBJECT> and another 
contains a similar statement <UNIT>Introduction to Programiming </UNIT>, 
AReXS is able to consider the possibility that the two attributes SUBJECT and 
UNIT are in this context signifying the same concept. If further correspondences 
could be found between other instances of these same attributes, the confidence 
of a conceptual match would increase. 

AReXS works by analysing two XML structures and identifying matching 
attributes, generating a map of equivalence between concepts represented in the 
two documents. Identification of conceptual equivalence is based on a consid- 
eration of lexicographical similarity between both the names and the values of 
attribute XML tags in each document. Matches are then assessed to deduce 
structural similarities between documents from different sources. By repeating 
this search for semiotic correspondence across other pairs of attributes gener- 
ated from the contents of the XML documents under consideration, AReXS is 
able to build a local context for data and then use this context to reconcile the 
ontological differences between XML documents. 

To establish the extent of the context shared by pairs of documents, the 
AReXS engine uses the Character-Based Best Match algorithm [10] to evaluate 
textual similarity between the names and values of attributes. Such a string based 
comparison works well to filter out simple manifestations of local cultures; for 
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example, one university web site may choose to include the identification number 
of a subject in the name of the subject while another may not, opting instead to 
have a second attribute containing a numeric identification code for each unit. 
While AReXS will not be able to realise that the number in the name of a subject 
from one university corresponds to the numeric unit code from another, it will 
generally conclude from the similarity of the names that units and subjects are 
conceptually compatible in this context. 

Applying a textual similarity analysis on real data is likely to generate a 
large number of candidate concepts that may or may not contribute to the 
local context of the data. AReXS increases its confidence in a candidate for 
equivalence depending on the uniqueness of the matches between attribute pairs. 
The uniqueness function described by [7] is used to establish the likelihood of 
a textual match between attributes actually revealing a shared, unique concept, 
based on the principle that the more common a concept is across significantly 
different attributes, the less rich the concept is and thus the less there is to be 
gained from considering it as part of the data context. 

The results of tests based on sample real world data from web sites including 
amazon.com, barnesandnoble.com, angusandrobertson.com.au and borders.com 
show that AReXS is capable of accurately identifying conceptually equivalent 
attributes based on both the attribute names and sample instances of the at- 
tributes. These web sites were chosen as useful examples for two reasons. Firstly, 
they are live, international representatives of the types of data source with which 
people desire to interact (and in fact already do interact) on a regular but casual 
basis, and secondly they provide data that by its nature is open to subjective 
decisions during the process of choosing a logical representation. The casual na- 
ture of the interaction that people generally have with sites such as these is 
important, as discussed earlier in this paper. 

The AReXS algorithms allow identification of concept matches regardless of 
the ordering of concepts or attributes, and its consideration of both names and 
values of attributes allows it to identify equivalences even if one of the name or 
the value is absent; in other words, AReXS is tolerant of inconsistent data. The 
AReXS engine has also demonstrated partial success in identifying many-to-one 
conceptual equivalences, which can occur in situations like that described earlier 
in which multiple concepts are represented by multiple attributes in one data 
source but only one attribute in the other data source. 

6 Further Thoughts 

AReXS is in reality only a prototype that serves as a demonstration of the po- 
tential for automated reconciliation of the ontological differences that manifest 
in data sources from a culturally heterogeneous environment. Because the ef- 
fectiveness of the concept matching algorithm is improved by examining more 
instances of the data, and each data attribute must be examined to increase the 
confidence of the conceptual matches, AReXS currently suffers from poor seal- 
ability as the complexity of data objects increases. The CBBM algorithm used 
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for comparing attribute names and values is heavily biased toward text strings 
and struggles with variations of numerical data. Due to the modular design of 
AReXS, this component of the engine could be significantly improved with a 
combination of simple heuristics, alternative matching algorithms and possibly 
even the capacity to pre-populate the data context with concepts previously ob- 
served or learned. AReXS currently can only work with flat or un-nested XML 
structures, although it is quite reasonable to imagine extending the principles it 
demonstrates to more complex data structures, or even incorporating the AReXS 
concept matching engine as a component in a more sophisticated data analysis 
system. 

Although AReXS only supports reconciling pairs of data sources, the EBFM 
algorithm on which it is based does allow for comparison of multiple sources 
and so extending AReXS to support this feature is feasible. While AReXS 
is partially able to recognise many-to-one equivalences, it will require further 
work to actually capitalise on this recognition. Finally, the principles imple- 
mented in AReXS could quite readily be adapted to allow the extension of data 
structures based on identification of concept matches within attribute names 
or values. Drawing on the example described earlier of university service de- 
scriptions, if one institution chose to present teaching units with an attribute 
of the form <UNIT>Machine Vision (Semester 1)</UNIT> and a second in- 
stitution opts for two attributes <SUBJECT>Machine Vision</SUBJECT> and 
<SEMESTER>K/SEMESTER>, it is possible to see that a software agent could use 
analysis techniques similar to those implemented in AReXS to realise that both 
attributes from the second source are encoded within a single attribute of the 
first source. 

A significant benefit of classifying knowledge into categories is that knowledge 
can be more readily reused and incorporated into other agents. Compartmental- 
ising knowledge also allows agents to teach each other about new information 
sources or even new knowledge domains. Domain knowledge is reusable by de- 
sign, and general knowledge is similarly useful. Given the modular approach 
to information agent construction presented in CASA, once an agent has been 
taught about a certain domain of knowledge, that knowledge can be applied to a 
variety of environments just as easily as it can a variety of sites. By plugging in 
a different general knowledge base, a web-based information agent could easily 
become an SQL- or XML-based information agent, with the cost of redevel- 
opment greatly reduced by the re-applicability of the domain knowledge base. 
It also seems quite feasible for an information agent to be armed with a vari- 
ety of general knowledge bases permitting it to work in multiple environments 
as appropriate, or even at the same time, utilising its knowledge as applicable 
both to process recognised information and to interpret and negotiate unfamiliar 
conceptual representations. 
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Abstract. This paper investigates foundations for the description of, 
and reasoning about, trust in secure digital communication. We propose 
a logic, called the Typed Modal Logic (TML), which extends hrst-order 
logic with typed variables and modal operators to express agent beliefs. 
Based on the logic, the theory of trust for a specihc security system can 
be established. Such trust theories provide a foundation for reasoning 
about trust in digital communication. 



1 Introduction 

Trust is essential to a communication channel. Modern secure digital commu- 
nication is usually based on cryptography. Investigating foundations for trust 
in secure digital communication, we need to answer the following questions: 
(1) What kind of trust is involved in a secure digital communication? (2) How 
should we specify trust and trust relations among agents involved in such com- 
munications? (3) Can trust be transferred as required? (4) How should we 
manage trust? (5) How should we reason about trust in such communications? 

This paper intends to address these questions by providing a formal theory 
and methodology for specifying and reasoning about trust in a system. In order 
to develop a formal theory of trust, we first start with a logic upon which the 
theory can be based. What kind of a logic is suitable for modelling trust in digital 
communication? One of the most desirable properties for a formal theory is an 
ability to capture what agents intend to say and what they are thinking about. 
This indicates that reasoning about trust involves reasoning about the notion of 
belief, and the theory of trust should therefore be based on a kind of belief logic. 

There are several logics, such as the BAN logic [1] and Rangan’s Logic [9], 
which can be use for reasoning about belief in secure communications. However, 
there is a lack of “standard” logical foundations and techniques which can gen- 
erally be used for specifying and reasoning about trust in modern secure digital 
communication. In this paper, we propose a new belief logic, called the Typed 
Modal Logic (TML), on which a trust theory for any particular security system, 
(such as a public key infrastructure), can be established. These trust theories 
provide a foundation for analysing and reasoning about trust in particular envi- 
ronments and systems. 
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The rest of this paper is organized as follows. Section 2 discusses the notion of 
trust in general, and talks about why we need a logic of belief. Section 3 presents 
the logic TML, including its syntax, semantics, and proof system. Section 4 
discusses trust theories with an example. Section 5 discusses the process and 
techniques for reasoning about trust. The last section concludes with a short 
discussion about future work. 



2 Trust and Belief 

The notion of trust is fundamental for understanding the interactions between 
agents such as human beings, machines, organizations, and other entities [7]. 
Linguistically, “trust” is closely related to “true” and “faithful”, with a usual 
dictionary meaning of “assured reliance on the character, the integrity, justice, 
etc., of a person, or something in which one places confidence.” So, in common 
English usage “trust” is what one places his confidence in, or expects to be 
truthful [3,11]. 

Digital communication involves computer systems. A computer system can be 
regarded as an interconnection of people, hardware, and software, together with 
their external connections. We view a secure digital communication environment 
(e.g., the Internet) as a large complex system consisting of a number of agents, 
i.e., entities who are involved in the system. Agents need to trust others in 
certain aspects if they are to have confidence that such interactions will lead to 
a desirable outcome. When we say that agent A trusts another agent B, this 
means that (in some sense) the two agents are situated in a state in which, from 
A’s perspective, certain actions by B will be chosen under certain circumstances. 
In other words, A may believe that B will truthfully do certain actions which 
concern A. For instance, if a data service system (B>S) trusts an authentication 
server (AS”) to verify the ID claim of any user who wants to access B>S, then B>S 
may believe that the information provided by AS can be used. As an example, 
DS may believe Alice is a legal user” because it knows AS says this. 

Discussing formal descriptions of trust in digital communication, we need to 
note the following features within the notion of trust: 

— No global trust exists in a secure digital communication environment. In 
other words, there is no agent who can be trusted by all others. This is 
obvious in distributed systems. In fact, even in a hierarchical system, such 
as a hierarchical PKI, although it is more likely that we may assume that 
all agents would trust the top Certification Authority, in practice there may 
still be some agents who do not trust it and may try to check its behavior 
in a variety of ways. 

— Trust is not static. The trust of an agent can be changed dynamically. For 
example, for two weeks agent A trusts agent B, but this morning A found 
that B lied to him, so A no longer trusts B. 

— There is no full trust. A’s full trust in B means that A believes everything 
B says. However, in most cases, this is impossible - an agent cannot trust 
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all statements provided by another agent. We choose a limited trust model, 
where “agent A trusts agent B” means that A will only trust B on some 
topics. 

— Trust relations lack the properties of transitivity and symmetry. That is, we 
cannot derive the conclusion “Ai trusts A3” from “Ai trusts A2” and “A2 
trusts A3” , and cannot assert that we should have “B trusts A” from ‘^A 
trusts B” . 

Let us consider the case in a Public Key Infrastructure (PKI) which manages 
public keys, where agent Alice wants to communicate securely with agent Bob, 
then Alice has to obtain Bob’s public key first. The PKI provides a mechanism 
for users to retrieve required certificates, so Alice can retrieve any certificates 
required. Once Alice has Bob’s certificate, in which Bob’s public key is bound, 
if Alice believes that the certificate is valid, then she may use the public key 
contained in Bob’s certificate to send secure messages to Bob. 

Consider an agent’s assertion about the truth of the proposition Valid(C'), 
where C is a certificate and the semantic interpretation of Valid(C') gives true 
if and only if C is indeed valid. Such an assertion made by an agent is obviously 
related to the agent’s belief. In fact, Alice would use Bob’s certificate only in 
the case she cannot prove Va.lid{Bob' s certificate) to be false from her beliefs. 
More strongly, Alice is not prepared to use Bob’s certificate unless she can prove 
Va.lid{Bob' s certificate) from her own beliefs. To infer Va.lid{Bob' s certificate) 
from her belief, Alice has to use some assumptions. In our approach, such as- 
sumptions will be encapsulated in a notion of trust for the system. 

From the above analysis, we may see that reasoning about trust actually 
involves reasoning about beliefs. Therefore, a theory of trust may be based on 
a logic that possesses the ability to represent beliefs. What kind of a logic can 
play the role? As Rangan [9] has pointed out, belief represents a disposition 
of an agent to a proposition, so a logic of expressing propositional dispositions 
should be able to expressing the required relations between believers and atti- 
tudes. Classical first-order logic cannot handle such relations well. The modal 
logic approach is able to enhance propositional and first-order logics with modal 
operators to represent agent beliefs. Considering this fact, we will attempt to 
develop an appropriate modal logic as a basis for establishing trust theories of 
secure digital communications. 

3 The Logic TML 

The logic we present in this paper is the Typed Modal Logic (TML), which 
is an extension of first-order logic with typed variables and modal operators 
expressing beliefs of rational agents. In this section, we discuss the syntax and 
semantics of TML, and present its proof system. 

3.1 Types 

In TML, all variables as well as functions are typed. A type can simply be viewed 
as a certain set of elements. Examples of simple types are numerical numbers 
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and boolean values. We first introduce several primitive types, which are used 
throughout this paper, as follows: i? (a set of agents), K. (a set of keys), S (string 
set), and Af (the set of natural numbers). In particular, we assume that the agent 
set 17 = {Ai , . . . , Ak}. Other primitive types can be introduced at any time as 
the need arises. We can also construct new types (so called constructive types) 
from existing types by using the recursive rule: if 7) and T 2 are types, so are 
Ti X T 2 and Ti ^ T 2 (cartesian products, and functions). 

The type of each variable is assigned in advance; the type of each constant 
is the type it belongs to. The type of a function is determined based on its 
definition, i.e., the types of all variables involved in it and the type corresponding 
to its range. For example, given a function f{Xi , . . . , 7f„), if the types of variables 
Xi, . . Xn are 7), . . ., %i, respectively, and the type of corresponding to the 
range of the function is T, then the type of f{Xi, . . . , X„) is Ti x . . . x T 2 ^ T. 

The type of any predicate is defined in the same way, but the type corre- 
sponding to the range is B, the Boolean type. Boolean type B consists of two 
elements, true and false. Thus, for any n-ary predicate p{Xi , . . . , X„), if 7), . . ., 
Tn are the types of Xi,. . Xn, respecively, then the type of the predicate is 
Ti X . . . X Tn ^ B. 

3.2 Syntax 

In our logic, we distinguish two different concepts, messages (in the first-order 
logic, called terms) and formulae. Messages can be names of agents, certificates, 
public keys, private keys, dates, strings having particular meanings, or other 
things. They can also be a combination (or sequence) of other messages. Messages 
are not formulae although formulas are built from messages. Only formulae can 
be true or false or have agent’s beliefs attributed to them. Formally, messages 
can inductively be defined as follows: 

- If is a variable or a constant of type T, then is a message of type T. 

- If Tfi, . . ., Xn are messages of type 7), . . ., %i respectively, and the type of 
an n-ary function / is 7) x . . . x 7^ ^ T, then f{Xi, . . . , X„) is a message 
of type T. 

- is a message (of a certain type) iff it is generated by the above formation 
rules. 

In the vocabulary of our logic, apart from typed variables, function and 
predicate symbols, we have the primitive propositional connectives, ^ and 
universal quantifier “V” and modal operators: for all Ai G 17. The formulae 

of the logic are therefore inductively defined as follows: 

- p{Xi , . . . , Xn) is a formula if p is a n-ary predicate symbol and Xi, . . . , Xn 
are the terms (messages) with corresponding types to p. In particular, we 
have: (1) S S' is a formula if 7f is a message of type T and S is a set that 
consists of some elements of T ; and (2) X = Y is a formula if X and Y are 
messages. 
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- and p —>■ Ip are formulae if p and ip are formulae. 

- yXp{X) is a formula if is a free variable in the formula p{X). 

- is a formula if is a formula, i = 1, . . . , /c. 

Here, most of the expressions are standard notation, so we only need to give a 
brief description for is read as “agent Ai believes”, so Ba^V^ means 

that the agent Ai believes p. In the language, other connectives. A, V and 
and 3 can be defined in the usual manner. 



3.3 Semantics 

An agent’s beliefs arise primarily from the agent’s assumptions about the global 
state of the system. Thus an agent’s state of belief corresponds to the extent to 
which, based on its local state, the agent can determine what global state the 
system is in. In this view, we can associate with each agent a set of possible 
global states that according to the agent’s beliefs, it could possibly be the real 
global state. Based on a local state, an agent cannot determine the real global 
state it is in; the agent can only conclude that some global states are possible. 
Therefore, an agent believes p if and only if p is true in all the global states that 
the agent consider possible. The agent does not believe p if and only if there is 
at least one global state it consider possible where p does not hold. 

From this analysis, a formal definition of the semantics for the Logic TML 
would be referred to as the possible- world semantics [6], using the notion of 
possible global states for the semantics interpretation of belief. 

Let S' be a set of states, 71,72, .. . be the types over which variables range, 
and [Q T] denote the set of functions from the type Q to the type T. We 
now define interpretations on the state set S as follows: An interpretation it on 
S comprises 

- assigning an element for each variable from its corresponding type; 

- assigning an element of \Ti^ x . . . x — > T] for each n-ary function with 

type 71^ X . . . X 71„ ^ T, and 

- for each state s G S, assigning an element of type [71^ x . . . x 71^ ^ for 
each n-ary predicate symbol, if the n variables are involved in the predicate 
have the types 71^,..., respectively. 

The reader should note that giving an interpretation of our logic needs to care- 
fully deal with types of variables, functions and predicates. In particular, we 
should note that assignments to predicates must be made for all states in S. 
Therefore, assignments can be different at different states. 

A Kripke structure with k agents Ai, . . ., is a tuple (S', tt, Rai , ■ ■ ■ , RAk)i 
where S is the set of all global states, tt is an interpretation on S, and RAt, 
i = 1, . . . , A:, are relations on the global states in S. Rah called the possibility 
relation according to agent A^, is defined as follows: (s,f) € RAi if and only if, 
in the global state s, Ai considers the global state t as possible (note that the 
global state s includes the local state information of Ai). 
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Let {S, 7T, Rai , ■ ■ ■ , Rau) be a Kripke structure. If /(ei, . . . , e„) is a term, then 
7r(/(ei, . . . , e„)) = 7r(/)(7r(ei), . . . , 7r(e„)), where / is an n-ary function symbol. 
In the following, \=^ ip stands for is true at w” or “i^ holds at re” , and iff for 
“if and only if” . The semantics of formulae in TML can inductively be given as 
follows: For any s G S', we have 

(1) For any n-ary predicate symbol y and terms ei, . . . , e„, p(ei, . . . , e„) iff, 
at s, p(ei, . . . , e„) has the truth value “trae” under the interpretation tt. We 
write it as 7r(s)(p(ei, . . . , e„)) = true. 

(2) iff it is not the case that |=g p. 

(3) |=s ^ V' iff l=s ““P or both |=s p and |=s ip. 

(4) \fXp{X), where Jf is a free variable appearing in p, iff, for all d G T (we 
assume that X has the type T), |=g p{d), where p{d) is obtained replacing 
all X by d in p{X). 

(5) iff; for all t such that (s,t) G i?^., \=t p 

Furthermore, for the given structure (S, tt, Ra^^ , ■ ■ ■ , RAk)j say that p is valid 
under the structure, and write |= if p for every state s G S; we say that p 
is satisfiable under the structure if f=s p for some state s G S. 

The proof system for a logic of belief depends on the properties of the pos- 
sibility relations. We say that a binary relation i? on a set S is reflexive if 
(s, s) G R for all s G S'; i? is is symmetric if, for all s,u G S, if (s, u) G R, then 
(m, s) G R; R is is transitive if, for all s,u,v G S, if (s, u) G R and (u, v) G R, then 
(s, v) G R; and R is is Euclidean if, for all s,u,v G S, if (s, u) G R and (s, v) G R, 
then (u, v) G R. Based on the semantics definition given above, we can show 
that the possibility relation for our notion of belief is symmetric, transitive, and 
Euclidean. We leave the proofs for the reader. Here we only point that an actual 
state may not be one of the possible states, therefore, the possibility relation 
is not reflexive. In fact, from the semantics definition, we can see that the fact 
“an agent believes that a formula is true” does not mean the formula is really 
true. The reason is that the agent’s cognition is “local” and only based on those 
“possible states” it has considered. 

3.4 Axioms and Rules of Inference 

The proof system of TML consists of a set of axioms and a set of rules of 
inference. The axiom set consists of the following axiom schemata: 

Al. p —>■ {'ip p) 

A2. (p^ X)) ^ ((‘P X)) 

A3, {-^p -'pt) ^ {p) ^ p) 

A4. yXp p, where p does not contain any free occurrence of X. 

A5. yXp{X) p{y), where Y is free against p{X). 

A6. yX{p ^ pi) ^ {p ^ yxpj), where 

p does not contain any free occurrence of X. 

A7. HaAp pP)/\ YtiAiP YtiAiP^, i = 1, . . . ,k. 

A8. ^Ai\~^P) ^ ^{^AiP),i=^,---,k. 
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The first six axioms are axiom schemata from the classical first-order logic; 
Axiom A7 involves transition of agent belief and is so called the belief transitive 
axiom; Axiom A8 deals with negation of belief, and it says that an agent believes 
iff it is not the case that the agent believes (/?, i.e., the agent does not believe 

‘P- 

Each axiom schema generates infinite instances, so there are infinite axioms 
contained within the proof system. 

The rules of inference in this logic include: 

Rl. From h ip and \- (p ^ ip infer h ip (Modus Ponens) 

R2. From h 'iX(p{X) infer h ip(Y) (Instantiation) 

R3. From h ip{X) infer h \/X(p{X) (Generalisation) 

R4. From h ip infer h HaiP, i = 1, ■ ■ ■ ,k (Necessitation) 

where A is a free variable, h is a metalinguistic symbol. ‘F h ip’ means that ip 
is derivable from the set of formulae F (and the axioms), ‘h ip’ means that ip is 
a theorem, i.e., derivable from axioms alone. 

For a logic, soundness and completeness are important issues. Our logic is 
sound and complete. The correction (soundness) of all axioms and rules of infer- 
ence can be expressed through the following results: For any formulae ip, ip and 
any agent Ai{i = 1, ... ,k), 



1. if ip is an instance of one of the axioms A1-A6, then \= ip. 

2. \= ip and \= ip —>■ ip, then |= ip. 

3. if ^ yXip{X) where the type of variable x is T, then ^ p>{d), d is an arbitrary 
constant in the type T. 

4. if \= ip(X) for any X G T, then |= yXip(X), where T is the type of X. 

5. h(BA^ AB..(^^^))^Br^. 

6. if \= ip then 

The proofs of these assertions are not difficult, and we omit them. Also, we 
do not attempt to discuss completeness of our logic, which would be covered in 
an extended version of this paper. 



4 Trust Theory 

As we stated before, in a given system supporting secure digital communication, 
if an agent wants to derive a conclusion from its belief, the agent has to use 
some assumptions regarding what the system should satisfy (and/or it should 
truthfully do) . All agents have to trust these assumptions unless they do not need 
the system. When an agent uses such assumptions, the agent actually places its 
trust in the system: it believes that the behavior of the system can be trusted. 
For example, in a public key infrastructure, agents may use such an assumption: 
“if a certificate is a valid, so is the public key contained in the certificate.” 
This actually comes from a implicit assumption that all agents trust that all 
CAs (Certification Authorities) to faithfully execute their operations and that 
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it is not viable to tamper with certificates as the cryptography protecting the 
certificate is sound. 

In our approach, such assumptions are encapsulated in the notion of trust 
and represented by a set of trust axioms of the system. Thus, a trust theory 
for the system consists of the logic TML, together with the set of axioms of the 
particular system. As an example, in the following, we present a theory for a 
PKI system, called the PKI Theory. 

A PKI refers to an infrastructure for distributing public keys where the au- 
thenticity of public key is certified by Certification Authorities (CAs). Without 
loss of generality, we define a PKI certificate to have the following form: 

Cert (I, DS, DE, S, PK, E, Sig) 

where I is the issuer, DS and DE are the start date and expiry date respectively, 
S is the subject of the certificate, PK is the value of the public key for S, E is 
the value of the extension field, and Sig holds the signature of the issuer I. We 
introduce a constructive type C as f2xJVxJVxnxK.xSxS, which is intended 
to represent the set of certificates. 

The types of variables are assigned as follows: let A, B, Ai, A2, ... be agent 
variables ranging over the type 17 ; C, Ci, 6*2, . . . certificate variables ranging 
over the type C; PK, PKi, PK2, ■ ■ ■ public key variables ranging over the 
type 1 C; SK, SKi, SK2, ■ ■ ■ private key variables ranging over the type /C; and 
T, Ti, T2, . . . time variables ranging over the type M . The constants we use in this 
paper include agent constants, such as alice, bob; certificate constants c, ci, C2, . . ., 
and time constants t,t\,t2, ■ ■ •, etc. A special time constant today is employed 
to represent the current time. 

With certificates we have eight projection functions defined as follows: for 
any certificate C = Cert (l, DS, DE, S, PK, E, Sig), 

T(C) = I ^(C) = DS DE(C) = DE 

S(C) = S ra(C') = PK E(C) = E 

Mg(C') = Sig tb^(C') = (I, DS, DE, S, PK, E) 

The meanings of these functions are obvious. We only point that tbs represents 
“to be signed”. 

We may wirte K = {PK, SK) to mean that the public key of key pair K 
is PK and the private key corresponding to PK is SK (sometimes write SK 
as SK{PK) to indicate the correspondence). Note that no one can calculate 
the private key from the public key although the correspondence has been rep- 
resented. {M}x and {M)x represent M encrypted under the key X and M 
decrypted under the key X respectively, where A is a public key or a private 
key. CRLxi denotes the certificate revoked list of an agent A^, i.e., at the current 
state, all certificates listed in CRpAi have been revoked by agent Ai. 

To verify a required certificate, agents should agree with the following as- 
sumptions concerning trust within the PKI, which form the set of trust axioms 
of the system: 
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Tl. VC'(Valid(C) ^ Valid(PK(C))) 

T2. yK{K = (re(C'), S'i4r(ra(C))) a Valid(re(C)) -> Valid(R:)) 

T3. \/K\/M{K = {PK, SK) A Valid(Ls:) ^ {{{M}sk)pk = M)) 

T4. = {PK, SK) A Valid(iL) ^ 1{{M}pk)sk = M)) 

T5. VC(3C'(V^d(C') A (1(C) = S(C')) A (t^(C) = (Sii(C))^(c,))) 

^ Valid(S^C))) _ _ 

T6. VC(Valid(Sig(C')) A today > DS(C) A today < DE(C') A ^(C £ CRL^f^Q-^) 
-> Valid(C)) 

The meanings of these axioms are as follows. Axiom Tl says that, if a certificate 
is valid, then the public key contained in the certificate is valid. Axiom T2 says 
that, if the public key bound to the subject of a certificate is valid, then the key 
pair consisting of the public key and the private key corresponding to it is valid. 
Axiom T3 says that, for any message M, we have {{M} sk) pk = M \i the key 
pair (PAT, S' AT) is valid. The meaning of T4 is symmetric to T3. Axioms T5 and 
T6 allow agents to verify the signature of a certificate as well as the certificate 
itself based on another certificate whose validity has been established. 

Digital signature algorithms usually involve use of a hash function, however, 
in order to simplify our discussion, we do not consider this. So, in axiom T5, to 
verify the signature of the certificate C, one is required only to check whether 
tb^(C) = (^(C))^(c:,) holds. 

Let TA = {Tl . . . , T6}, then TML and TA together construct the PKI theory, 
a trust theory for the PKI. In the next section, we discuss reasoning about trust 
through demonstrating a practical example based on this theory. 



5 Reasoning about Trust: An Example 

Suppose that marts holds a certificate, ci, and chuchang holds a certificate, C 2 , 
which is signed by marts with his private key corresponding to the public key 
bound to Cl. Consider the case: john requires chuchang' s certificate, he trusts 
marts and he in particular trusts marts' certificate, i.e., he believes that ci is 
valid, but at the moment, he does not trust chuchang's certificate C 2 . Therefore, 
in order to use C 2 , john must verify it. 

Based on the PKI theory, the verification process can be outlined as follows: 

(1) Bjo/i„Valid(ci). (assumption) 

(2) I(c 2 ) = S(ci)(= marts). (assumption) 

(3) Bjohn(J{c 2 ) = S(ci)). (by Rule R4) 

(4) tbs(c 2 ) = (Sig(c 2 ))^(cj). (b® checked and assumed to be true) 

(5) Bjo/,„(tbs(c2) = (Slg(c2))^(c^)). (by Rule R4) 

(6) Bjo;,„(Valid(ci) A (I(c 2 ) = S(ci)) A (tbs(c 2 ) = (Sig(C 2 ))^(^^))). 

(from (1), (3) & (5)) 

(7) Valid(ci) A (I(c 2 ) = S(ci)) A (tbi(c 2 ) = (Mg(c2))re(ci))) 

Valid(Sig(c 2 )). (by axiom T5 & rule R2) 
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(8) Ejo;,„(Valid(ci) A (I(c 2 ) = S(ci)) A (tbs(c 2 ) = (Sig(c 2 ))pK(ci))) 

^ Valid(Sig(c 2 ))). (by rule R4) 

(9) Bjo/i„Valid(Sig(c 2 )). (from (6) & (8), and by axiom A7 and rule Rl) 



Furthermore, if the formulae today > DS(c 2 ), today < DE(c 2 ) and ^(c 2 € 
CRLmaris) are all checked and hold, then we have 

(10) {today > DS(c 2 ) A today < DE(c 2 ) A ^(c 2 € CRLmaris))- 
Thus, we can have 

(11) Hjohn{today > DS(c 2 ) A today < DE(c 2 ) A ^(c 2 G CRLmans))- (by R4) 

(12) Bjo/j„(Valid(Sig(c 2 )) A today > DS(c 2 ) A today < DE(c 2 )A 

^(C2 G CRLmaris))- (from (9) & (11)) 

(13) Valid(Sig(c 2 )) A today > DS(c 2 ) A today < DE(c 2 ) A ^(c 2 G C RLmaris) 

Valid(c 2 ). (by axiom T6 and rule R2) 

(14) Bjo/j„(Valid(Sig(c 2 )) A today > DS(c 2 ) A today < DE(c 2 )A 

^(C 2 G CRLjnaris) Valid(c 2 )). (by rule R4) 

(15) Bjo/j„Valid(c 2 ). (from (12) & (14), by axiom A7 and rule Rl) 

Having completed the proof, we can therefore have 
(*) Bjo/j„Valid(ci) hBjo?,„Valid(c2). 

The expression (*) can formally be read as follows: the fact “John believes that 
C 2 , chuchang's certificate, is valid” is derived from the fact “john believes that 
Cl, maris’s certificate, is valid”. Intutively, such a expression represents a trust 
transfer: an agent’s trust in a certificate may be transferred from its trust in 
another certificate. In general, an expression ‘B^.(p FB^^'i/:’ represents that an 
agent’s trust in ■0 is transferred from its trust in cp (or its belief in if) is transferred 
from its belief in p). 

Trust axioms T2-T4 are not directly used in the proof process. However, 
we have to point out that checking if tbs(C' 2 ) = (Sig(C' 2 ))pK(Ci) holds lies in 
the validity of the key K = (PK(Ci), S'AT(PK(Ci))), and the fact that the agent 
believes that if K is valid, then = M for any message M. 

Therefore, these axioms are also needed. 

In general, verifying the validity of a required certificate involves obtaining 
and verifying the certificates from a trusted certificate to the target certificate. 
Obtaining the certificates is referred to as certificate path development and check- 
ing the validity of the certification path is referred to as certification path val- 
idation. A certification path is usually defined to be a non-empty sequence of 
certificates (Co, . . . , Cn), where Co is a trusted certificate, C„ is the target cer- 
tificate, and for alH (0 < 1 < n) the subject of C is the issuer of C/+i. Once a 
certification path (Co, . . . , C„) bas been developed for agent Ai to verify cerifi- 
cate Cn, then from the fact that Co is a certificate trusted by Ai we can assert 
that 
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B^,Valid(C'o). 

That is, Ai believes that Cq is valid. The path validation involves checking 
whether A^’s trust in Co can be transferred to its trust in C„, i.e., it needs to 
prove the following trust transferring: 

B^,Valid(C'o) hB^,Valid(C'i), 

B^.Valid^Ci) hB^,Valid(C'2), 



B^,Valid(C'„_i) FBA,Valid(C'„). 

Unless all proofs of these trust transferring are successfully completed, the agent 
Ai cannot accept C„ as valid by this path. 

PKIs provide a mechanism for agents to transfer their trust from where it 
exists to where it is needed, while our logic allows agents to check the correctness 
of such trust transfer based on the trust theory. However, we have to note that 
PKIs do not create trust [5]. Any PKI is only able to propagate it: agents must 
initially trust something. Usually, initial trust is established off-line. In our ap- 
proach, initial trust can also be formalized as proper axioms in the trust theory 
of the PKI. Once the set of trust axioms for a given PKI is given, agents can 
obtain their trust bases as well as the initial trusted certificate set. For detailed 
discussion about this, we refer the reader to Liu et al. [7]. 

6 Conclusion 

We have presented a typed modal logic that can be used for describing and 
reasoning about trust in secure digital communication. The modal logic is sound 
and complete and, based on this logic, a trust theory for a given system can 
be established. Thus, from agents’ initial beliefs, trust can be transferred from 
where the trust exists initially to somewhere else where it may be needed, and 
the correctness of the transfer process can formally be proved. Our approach is 
flexible, as it not only applies to a range of applications, such as analysing and 
designing authentication protocols, but also can be easily modified by deleting 
or add trust axioms for any specific purpose. The examples given in the paper 
also show that the proof process based on a trust theory can automatically be 
implemented once we have mechanized our logic and trust axioms in a certain 
prover, such as Isabelle [8]. 

As we pointed before, there are a number of logics which have been devel- 
oped for specifying and reasoning about agents’ beliefs, especially BAN Logic 
family [1,2,4,10] have widely been discussed and applied for the analysis of iden- 
tification/authentication protocols, particularly authenticated key distribution 
protocols. BAN logic is a many-sorted modal logic. It includes several sorts of 
objects: principals, encryption keys, and statements (formulae). Our typed logic 
is close to such logic, but we have separated the trust axioms from the logic 
which is regarded as the basis for building the theory of trust. The advantage 
of our approach is to make the logic more flexible. Rangan [9] treated a theory 
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of trust as consisting of a logic and a set of trust axioms, but did not consider 
types. 

Future work may include completing the theoretical study of the logic TML, 
and investigating techniques for reasoning about trust in a well-constructed the- 
ory. We may consider different distributions of trust points within a specific 
system and continue to investigate trust models. Mechanizing trust theories is 
also planned. 
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Abstract. Biological systems have often provided inspiration for the 
design of artihcial systems. On such example of a natural system that 
has inspired researchers is the ant colony. In this paper an algorithm 
for multi-agent reinforcement learning, a modified Q-learning, is pro- 
posed. The algorithm is inspired by the natural behaviour of ants, which 
deposit pheromones in the environment to communicate. The benefit 
besides simulating ant behaviour in a colony is to design complex multi- 
agent systems. Complex behaviour can emerge from relatively simple 
interacting agents. The proposed Q-learning update equation includes a 
belief factor. The belief factor reflects the confidence the agent has in the 
pheromone detected in its environment. Agents communicate implicitly 
to co-operate in learning to solve a path-planning problem. The results 
indicate that combining synthetic pheromone with standard Q-learning 
speeds up the learning process. It will be shown that the agents can be 
biased towards a preferred solution by adjusting the pheromone deposit 
and evaporation rates. 



Keywords: Machine Learning, Reinforcement Learning, Multi-agent system 

1 Introduction 

An ant colony displays collective problem solving ability [4, 9] . Complex be- 
havioural patterns emerge from the interaction of relatively simple behaviour 
of individuals. A characteristic that artificial multi-agent systems seek to re- 
produce. The ant colony exhibits among other features, co-operation and co- 
ordination, and communicate implicitly by depositing pheromones. An ant for- 
aging will deposit a trail of pheromones. The problem is that of learning the 
shortest path between nest and food whilst minimising effort. The aim of the 
work described in this paper is to develop an algorithm for multi-agent learning 
inspired by the search strategies of foraging ants, using synthetic pheromones. In 
particular we use Q-Learning augmented with a belief factor. The belief factor 
is a function of the pheromone concentration on the trail and reflects the extent 
to which an agent will take into consideration the information lay down by all 
agents within the environment. Reinforcement learning and synthetic pheromone 
have been combined for action selection strategies [15, 20]. The usefulness of the 
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belief factor is that it allows an agent to selectively make use of communication 
from other agents where the information may not be reliable due to changes in 
the environment. An important issue when designing intelligent agents for the 
real world. 

Section 2 presents related work in ant behaviour modelling and ant systems 
that have applied ant foraging mechanisms to optimisation problems. Section 3 
describes the natural behaviour of ants in a colony. Section 4 discusses reinforce- 
ment learning, specifically Q-learning, followed in Section 5 by the pheromone-Q 
learning update equation. Experiments and results obtained with this algorithm 
are described in Sections 6 and 7 respectively. Section 9 gives some indication 
of future work and finally the paper concludes in Section 10. 



2 Related Work 



The work described in this paper is inspired by ant foraging mechanisms. The 
aim is to produce useful problem-solving behaviours from relatively simple be- 
haviours in software agents. In common with all works described in this action, 
it uses synthetic pheromones for communication in a multi-agent environment. 
The agents can detect pheromone deposited on the agent trails. Ant behaviour 
has been researched both for the understanding of the ant colony behaviour and 
also to develop intelligent systems. 

Ollason, in [16,17], reports a deterministic mathematical model for feeding 
ants. The model predicts the behaviour of ants moving from one regenerating 
food source to the next. Anderson [1] extends Ollason’s work to simulate a colony 
of ants feeding from a number of regenerating food sources. 

Though not intended for ant behaviour modelling or simulation, a method- 
ology inspired by the ant behaviour was developed in [7, 11, 13]. While foraging 
for food, certain ant species find the shortest path between a food source and 
the nest [2]. Some of the mechanisms adopted by foraging ants have been ap- 
plied to classical NP-hard combinatorial optimisation problems with success. In 
[10] Ant Colony Optimisation is used to solve the travelling salesman problem, 
a quadratic assignment problem in [13], the job-shop scheduling problem in [6], 
and the Missionaries and Cannibals problem in [18]. 

In [12] Gambardella suggests a connection between the ant optimisation al- 
gorithm and reinforcement learning (RL) and proposes a family of algorithms 
(Ant-Q) related to the RL Q-learning. The ant optimisation algorithm is a spe- 
cial case of the Ant-Q family. In these works, synthetic pheromone is used in 
the action selection strategy whereas in the work presented in this paper, the 
pheromone detected is integrated into the update equation. 

The merging of Ant foraging mechanisms and reinforcement learning is also 
described in [15]. Three mechanisms found in ant trail formation were used as 
exploration strategy in a robot navigation task. In this work as with the Ant-Q 
algorithm, the information provided by the pheromone is used for the action 
selection mechanism. 
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Another work inspired by ant behaviour is reported in [20]. It is applied to 
a multi-robotic environment where robots transport objects between locations. 
Rather than physically laying a trail of synthetic pheromones, the robots com- 
municate path information via a shared memory. 



3 Ant Behaviour 

Ants are able to find the shortest path between a nest and a food source by 
an autocatalytic process [3, 14]. This process comes about because ants deposit 
pheromones on the trail as they move along in the search for food or resources to 
construct a nest. The pheromone evaporates with time nevertheless ants follow 
a pheromone trail and at a branching point prefer to follow the path with the 
highest concentration of pheromone. On finding the food source, the ants return 
laden to the nest depositing more pheromone along the way thus reinforcing the 
pheromone trail. Ants that have followed the shortest path return quicker to the 
nest, reinforcing the pheromone trail at a faster rate than those ants that followed 
an alternative longer route. Further ants arriving at the branching point choose 
to follow the path with the highest concentration of pheromone thus reinforcing 
even further the pheromone and eventually most ants follow the shortest path. 
The amount of pheromone secreted is a function of an angle between the path 
and a line joining the food and nest locations [5]. Deneubourg [8] found that 
some ants make U-turns after a branch, and a greater number will make a U- 
turn to return to the nest or to follow the shorter path after initially selecting 
the longer path. This U-turn process reinforces the aggregation of pheromone 
on the shortest path. 

So far two properties of pheromone secretion were mentioned: aggregation 
and evaporation [19]. The concentration adds when ants deposit pheromone at 
the same location, and over time evaporation causes a gradual reduction in 
pheromone concentration. A third property is diffusion [19]. The pheromone at 
a location diffuses into neighbouring locations. 

4 Reinforcement Learning 

Reinforcement Learning (RL) is a machine learning technique whereby an agent 
learns by trial and error which action to perform by interacting with the envi- 
ronment. Models of the agent or environment are not required. At each discrete 
time step, the agent selects an action given the current state and execute the 
action, causing the environment to move to the next state. The agent receives 
a reward that reflects the value of the action taken. The objective of the agent 
is to maximise the sum of rewards received when starting from an initial state 
and ending in a goal state. One form of RL is Q-Learning [21]. The objective 
in Q-learning is to generate Q-values (quality values) for each state-action pair. 
At each time step, the agent observes the state St, and takes action a. It then 
receives a reward r dependent on the new state St+i. The reward may be dis- 
counted into the future, meaning that rewards received n time steps into the 
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future are worth less by a factor 7 " than rewards received in the present. Thus 
the cumulative discounted reward is given by ( 1 ) 

i? = r* + yrt+i + 7^n+2 H h 7”n+„ (1) 

where 0 < 7 < 1. The Q- value is updated at each step using the update equation 
( 2) for a non-deterministic Markov Decision Process (MDP) 

0^) (1 T 

a„(rt+j-maXa'Qn-i(st+i,a')) (2) 

where a„ = (st a) ■ Q-learning can be implemented using a look-up table 

to store the values of Q for a relatively small state space. Neural networks are 
also used for the Q-function approximation. 



5 The Pheromone-Q (Phe-Q) Learning 



The main difference between the Q-learning update equation and the pheromone- 
Q update equation is the introduction of a belief factor that must also be max- 
imised. The belief factor is a function of synthetic pheromone. The synthetic 
pheromone (^(s)) is a scalar value, where s is a state(a cell in the grid world) 
that comprises three components: aggregation, evaporation and diffusion. The 
pheromone <P(s) has two possible discrete values, a value for the pheromone de- 
posited when searching for food and when returning to the nest with the food. 

The belief factor (B) dictates the extent to which an agent believes in the 
pheromone that it detects. An agent, during early training episodes, will believe 
to a lesser degree in the pheromone map because all agents are biased towards 
exploration. The belief factor is given by ( 3) 



B(st+i,a) 






( 3 ) 



where <P(s) is the pheromone concentration in a cell/state, s, on the grid and Na 
is the set of neighbouring cells. 

The Q-Learning update equation modified to take into account the synthetic 
pheromone is given by ( 4) 

Qn(^t ^ 0 :) (1 Ctn)Qn—l T 

an(rt + 7' ■ maxa' (Qn-i(st+i, a') + ^B(st+i, a')) (4) 

where the parameter, is a sigmoid function of time {epochs > 0). The value 
of ^ increases as the number of agents successfully accomplish the task. 



6 Methodology 

It will be shown that the modified update equation converges. Speed of con- 
vergence determines how fast an agent learns. The objective of the experiments 
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is to evaluate the modified updating equation for Phe-Q and confirm empiri- 
cally convergence. Phe-Q will be compared to standard Q-learning for speed of 
convergence. 

For the experiments reported the agent environment is a, N x N, grid where 
N = 10, 20, 40. Each cell has an associated pheromone strength (a scalar value). 
The agents are placed at a starting cell (the nest) on the grid. The aim is for the 
agents to locate ’food’ occupying one or more cells throughout the grid space 
and return to the starting cell. The agents move from cell to cell depositing 
discrete quantities of pheromone in each cell. There are two pheromone values, 
one associated with the search for a food location (outbound pheromone) and 
the other associated with the return to the nest (return pheromone) . The values 
for the outbound and return pheromone concentrations were ’manually’ adjusted 
to optimise the Phe-Q agent’s search performance. This will be further discussed 
in the section 8. The pheromone adds linearly (aggregates) in a cell up to an 
upper bound, and evaporates at a rate (evaporation rate (ipa) until there is none 
remaining if the cell pheromone is not replenished. Each agent has a set of tasks 
to accomplish, each task has an associated Q-table. The first task is to reach the 
’food’ location, and the second task is to return to the nest laden with food. 

More than one agent can occupy a cell within the NxN grid. The pheromone 
strength is <P G [0, 100] at a location. Pheromone is de-coupled from the state at 
the implementation level so that the size of the state space is N x N, a single 
cell corresponds to a state. For a small grid, e.g. A'' < 40, a lookup table is used 
for maintaining the Q values. 

The agent receives a reward on completing the tasks i.e. when it locates the 
food and when it returns to the nest. Each experiment consists of a number 
of agents released into the environment and running in parallel for 500 to 1000 
epochs. Each epoch is the time from the agent’s release from the nest to the 
agent’s return to the nest. 

In the experiments, the search is achieved with pheromone aggregation and 
evaporation. Diffusion has not yet been implemented. The outbound pheromone 
strength was varied between 0.5 and 1.5 units, and the return pheromone strength 
was varied between 5.0 and 40.0 units. While returning to the nest the agents 
do not make ’use’ of pheromone for guidance. The experiment was run with and 
without obstacles in the grid space. An agent cannot occupy the same cell as an 
obstacle. It must navigate around the obstacle. 

Table 1. Pheromone variables : fast convergence 



Pheromone ||Phe(food)|Phe(nest)|Phe(evaporation) 



Phe-Q agent 1.0 



10.0 



1.0 
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Table 2. Pheromone variables : location search bias 



Pheromone 


Phe-food 


Phe-nest 


Phe-evaporation 


Phe-Q agent 


1.5 


30.0 


0.3 



7 Results 

To demonstrate empirically convergence of the update equation Phe-Q, the RMS 
of the error between successive Q- values is plotted against epoch (an epoch is 
a complete cycle of locating food and returning to the nest). The RMS curve 
for Phe-Q (averaged over 10 agents) seen in Figure 1 shows convergence. For 
comparison, the Q-learning RMS curve is also shown in the same graph. Phe-Q 
learning converges faster than Q learning. This particular experiment was run 
with a number of obstacles scattered throughout the grid. With fewer obstacles, 
Phe-Q converges at a faster rate and the difference between Phe-Q learning and 
Q-learning is greater. For a given grid size, there is a limit to the number of agents 
for which Phe-Q performs better than Q-Learning with or without obstacles. In 
a 20 X 20 grid space, the performance of the Phe-Q agent degrades to that of 
Q-learning with approximately 30 agents. The graph in Figure 2 shows the RMS 
curves for an increasing number of Phe-Q agents maintaining a constant grid 
size (for clarity only the RMS curves for 5, 40, and 60 agents are shown on 
the graph). Between 5 and 20 agents, the speeds of convergence of Phe-Q are 
comparable. Above that number, the trend is slower convergence, a phenomenon 
that does not occur with Q-learning. The reason for this is explained in the next 
section. 




100 



200 



150 200 250 



Fig. 1. RMS curve for Phe-Q learning 



Fig. 2. Performance scaling 



The graph in Figure 3 shows the search performance of the Q-learning and 
the Phe-Q learning agents with two food sources in a 20 x 20 grid. The first food 
source was located at the opposite corner (cell 399) diagonally to the nest (cell 0), 
and the second food source was located midway, at the centre of the grid. Phe-Q 
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learning converges faster than Q learning. The objective of the experiment was 
to determine which food location the two types of agents would prefer. 

In Figure 4, the number of visits per episode (locate food and return) is plot- 
ted. Results indicate that for a relatively simple ’world’ e.g. as described above 
one food source located centrally, both types of agents learn to visit mainly the 
closest food location as expected but the Q-agent visited the closer food location 
more frequently than the Phe-Q agent. However when the closest centrally lo- 
cated food source was surrounded by obstacles on three sides (the unobstructed 
side was facing away from the nest) and the more distant food source was unob- 
structed, the Q-agent visited the hidden food sources with similar frequency as 
in the previous experiment with no obstacles however the Phe-Q agent visited 
the hidden food source less frequently as shown in Figure 4. It was also found 
that the Phe-Q agents converged faster with two or more food sources. These 
results were obtained with the outbound and return pheromone values shown in 
Table 1. From the table it can be seen that the outbound pheromone is low (1 
unit) and the return pheromone is 10 times higher (10 units). The evaporation 
rate was set to 1 unit at each discrete time step, which meant that the outbound 
pheromone plays a minor role. Experimental results showed that by increasing 
the pheromone concentration deposited on their return, Phe-Q agent performed 
less well in terms of speed of convergence, degrading to that of the Q-agent. 
However in both cases, as with real ants, the agents have preferred the shortest 
path by a ratio from 20:1 to 25:1 even when the path to the closest food was 
partially obstructed. An important point to note is that the Phe-Q agent learnt 
to avoid the obstacles more frequently and opted for a more distant but easily 
located food source. 
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Fig. 3. Two competing food sources Fig. 4. Visits to closest but hidden food 

source 



In the above experiments, speed of convergence i.e. faster learning was the 
main goal and the pheromone variables were selected by experimentation for 
that purpose. It was found that by selecting a range of pheromone variables, 
the behaviour of the Phe-Q agents could be biased towards a preference for 
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one or another food source. For example using the pheromone variables set in 
Table 2, the Phe-Q agent preferred the closest food source even if it was partially 
obstructed as shown in Figure 5. Note that with the pheromone variables in Table 
1, the Phe-Q agent preferred the distant unobstructed food. In particular with 
three food sources, two of which were hidden, the Phe-Q agents with variables in 
Table 2 were biased towards the closest, most obscured food location as shown 
in Figure 6. 




Fig. 5. Two competing food sources Fig. 6. Three competing food source, one 

hidden 



The effective use of pheromone aggregation and evaporation rates influence 
the search patterns i.e. which item an agent searches for. The pheromone vari- 
ables can be chosen to meet a particular application. 

The objective of the following experiment was to test the adaptability (and 
flexibility) of the Phe-Q agent as compared to the Q-agent. Since the Phe-Q 
agent converges faster than the Q-agent it is expected that it will adapt to 
change quicker than the Q-agent. This is shown in Figure 7. The course is Y- 
shaped with a single food source in each branch. The food source in the left 
branch is depleted after a number of visits. From the RMS curves for both types 
of agents (Figure 7), it is seen that the Phe-Q agent adapts quicker to the new 
situation than the Q-agent. 

8 Discussion 

The synthetic pheromone guides the agents. It is implicit communication. At 
the implementation level, a pheromone map is produced. This map is de-coupled 
from the grid ’world’ thus reducing the state space. The information exchange via 
pheromone enables the agents to learn faster as demonstrated by faster conver- 
gence. Phe-Q was compared with Q-learning, in both cases using the greedy and 
Boltzmann action selection mechanisms. Phe-Q using Boltzmann was seen to 
perform better than all three other combinations. There is however a price to be 
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Fig. 7. Food source depletion after 200 visits 



paid for this information sharing. Too much information i.e. too high pheromone 
deposits rate or low pheromone evaporation rates causes not unexpectedly poorer 
results especially in the earlier learning stages. Agents are ’mislead’ by exploring 
agents. However it was seen that the agents were not highly sensitive to the 
degree of pheromone information belief. In addition, it was expected that the 
agent may be ’deceived’ by its own pheromone, influenced by pheromone just 
previously deposited. It was anticipated that this could lead to cycling. However 
the higher exploration rates in the early learning phases prevents cycling from 
becoming a problem. 

Whereas with non-interacting Q-learning agents, the convergence speed does 
not change with number of agents, with Phe-Q learning, it was seen that there 
is an upper limit to the number of agents searching a space while maintaining 
faster convergence (with respect to Q-learning). Too high a number of agents 
slows down learning (convergence) . The pheromone deposited by large numbers 
of exploring agents ’mislead’ agents. In addition, with a high number of agents 
the solution also becomes computationally intensive. 

The results show that the Phe-Q agent adapts quicker to changes than the 
Q-agent. This is to be expected as it learns faster. This is a useful characteristic 
in a dynamic, changing environment. 



9 Future Work 

Phe-Q will be compared to other reinforcement learning techniques specifically 
eligibility traces. An advantage of a multi-agent system compared to a single 
monolithic agent is the emergence of a more ’complex’ behaviour. In this par- 
ticular case the agents are required to communicate and co-ordinate to solve a 
problem. The more complex the problem, the greater the benefits of the multi- 
agent solution. It remains to be seen if the performance of Phe-Q can be applied 
to different types of problems. 

There are several variables to be tweaked in order to optimise the problem 
solving capability of the Phe-Q agent. Particularly with respect to pheromone 
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concentrations, dropping rates, evaporation rates, and diffusion rates across cells. 
It is intended to reverse engineer the problem in order to find the optimum values. 

An issue currently under investigation is that of agent trust. So far the agents 
have believed the shared information. The authors are looking into deception 
whereby agents use synthetic pheromone information to deceive agents inhabit- 
ing the ’world’ and deceived agents backtrack. This work will lead to modelling 
deception and countering deception. 

10 Conclusions 

The work described in this paper set out to investigate the use of synthetic 
pheromone for implicit communication to speed up multi-agent learning. Rather 
than using pheromone information directly for the action selection strategies, 
each agent calculates a belief value for the information based on pheromone 
concentration in the four surrounding cells. The belief is maximised together 
with the Q-value. This technique, Phe-Q learning, was shown to converge faster 
than Q-learning when searching for food at different locations in virtual spaces 
with varying degrees of complexity (obstacles). With two food sources, Q-agents 
had a preference for the closest source almost to the exclusion of the furthest 
food source, irrespective of whether the closer food source was hidden or not, in 
the process taking more time to learn the solution. Phe-Q agents can be biased 
towards a particular food source. 

The Phe-Q agent also showed greater adaptability to changes in the envi- 
ronment. This is an important characteristic for agents inhabiting a noisy real 
world. 
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Abstract. In this paper, we apply the algorithms to facilitate learn- 
ing to kansei modeling and experimentally investigate constructed kan- 
sei model itself. We introduce using a vector space as a scheme of the 
mental representation and place still images in the perceptual space by 
generating perceptual features. Furthermore we propose a method to ma- 
nipulate the perceptual data by optimizing modeling parameters based 
on the kansei scale. After this adaptation we compare the similarity be- 
tween the kansei clusters using their distance in the space to evaluate if 
the adapting perceptual space is appropriate for one’s kansei. We have 
conducted preliminary experiments utilizing image data of TV commer- 
cials and briefly evaluated the mental space constructed by our method 
through the kansei questionnaire. 



1 Introduction 

Human perceives objects or images through not only sensory perception but also 
judgment based on their memory, experience and preferences[l]. For example, 
one person may consider a picture to be attractive and another may not. This 
occurs due to differences in people’s viewpoints even though they have similar 
sensors. 

In a research project, we have identified various viewpoints of users and apply 
them to multimedia data[2,4, 3]. We have also constructed a decision support 
system for creators of TV commercial films using data including still images and 
consumers’ reports to evaluate our method. Since TV commercial creators are 
expected to produce attractive TV commercials for the target consumers of the 
product, that is important for them to grasp the target consumers’ kansei^ and 
to make TV commercials that appeal to those kansei. TV commercial creators 
are required to propose the factor that arouses those kansei first of all. 

To specify the factor, kansei modeling representing subjective perceptions is 
studied in human media project [5, 6]. They consider subjective interpretation as 

^ Kansei is a Japanese word and implies human reaction under various stimuli ranging 
from sensory to mental state, that is sensitivity, sense, sensibility, feeling, esthetics, 
emotion, affection and intuition. 
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a ratio of regarding as important for each physical features of the image, such 
as color, direction and position in related to stimuli. This is realized by using 
statistical method[7] or neural network[8] in the previous studies, but there are 
still difficulties in performance to predict subjective interpretation according to 
the experiments in the way of evaluating learning algorithms. Furthermore they 
have not conducted the experiments to analyze constructed kansei model itself 
yet. 

Regarding the construction of good human perceptual models, artificial in- 
telligence researchers have seldom studied how to construct a suitable model 
for a learning system despite of the excellent results achieved through the im- 
provement of learning algorithms. Various well-constructed machine learning al- 
gorithms(e.g. ID3[9], C4.5[10], Progol[ll]) have been proposed and used widely. 
Although these algorithms show sufficient learning ability if the world is modeled 
accurately, it is still quite difficult to represent the real world due to noisy data 
and irrelevant features. Even if those problems were solved, the personal prefer- 
ences would be a bigger obstacle. If a machine constructs the model adaptively, 
that is, if the machine interprets one’s mental model and creates a suitable 
model of it, the machine can capture one’s preference, judgment or behavior. 
One approach to achieve this is constructive induction [12], i.e., the automatic 
computation of suitable feature representations for machine learning tasks. On 
the other, we introduced representation of instance data in a perceptual vector 
space by generating perceptual features and proposed a method to manipulate 
the perceptual data by adapting modeling parameters to the task [2]. The algo- 
rithms to construct a model to facilitate learning and the experiments to confirm 
their high performance were reported[13j. 

In this paper, we apply the algorithms to facilitate learning to kansei model- 
ing and experimentally investigate constructed kansei model itself. We represent 
the TV commercial images in the perceptual space by generating perceptual 
features and adapt the perceptual data by optimizing the modeling parameters 
based on the kansei scale[14] . After this adaptation, we compare the similarity be- 
tween kansei clusters using their distance in the space to evaluate if the adapting 
perceptual space is appropriate for one’s kansei. We describe the representation 
of TV commercial image data in a perceptual instance space and how to imple- 
ment it in section2. We show an algorithm to manipulate the perceptual data 
by adapting modeling parameters for the task in section3. Preliminary experi- 
ments utilizing real-world data and a brief evaluation are presented in section4. 
Sections identifies future directions for this work and presents conclusions. 



2 Perceptual Modeling of TV Commercial Data 

2.1 Perceptual Instance Space 

We propose to use a vector space as a scheme of the mental representation. A 
CM image data perceived by a person corresponds to a point in the vector space. 
This approach has the following advantages. 
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— Similarity between situations in the software’s mental space can be easily 
measured by means of Euclidean distance between corresponding points in 
the vector space. 

— It is easy to manipulate the points by something other than the input stimuli, 
if it is related to the axes composing the vector space. 

— A set of points in the vector space can be seen as a private concept of the 
software. This view allows us to give it a simple means to form a new concept, 
i.e., by standard clustering techniques. 

In this approach, the nature of a perception is characterized by the axes of 
a vector space and how input stimuli are transformed into points in the space. 
Ideally the software should be able to construct the axes on demand. In this 
paper we simplify the problem by giving the software a reasonably large space 
and letting the software use its subspaces freely. 

Regarding the transformation, behavior of the transformers should be sys- 
tematically controllable because we implicitly want to adjust the behavior to 
optimize efficiency of the software at a task, so that the software can gradually 
improve its perception through the task executions. This can be done by param- 
eterizing the procedures. We are thinking that adaptation of the transformers is 
more important and investigating the possibility of achieving that. 



2.2 Design of Features of Space 

A human obtains millions of items of information from vision [15] and simulating 
all of them exceeds this study’s scope. Besides verbal information contained in 
a TV commercial, we consider that the factors that affect human perception are 
mainly involved with color and TV personalities in the image. From this point 
of view we introduce features respecting index colors and TV personalities in 
the TV commercial as perceptual axes in the space. For numerical data related 
to the personalities such as the age or popularity, we applied sigmoid function 
denoted as follows to map the value in the range of (0, 1). Intuitively, the value 
of the function stands for the representation of the data in the mental space. 



1 -I- exp(— fc(a; — xq)) 

Here xq represents the mean of x and k indicates the slope of the curve in 
X = Xq- For the index colors, we extract them by means of a modeling program 
involving parameters. We built a model of human perceptions of the index color 
in the image. The algorithm consists of two steps, which are described in the fol- 
lowing subsections. The steps involve parameters. The parameters are adjusted 
to make a space of index colors useful for a task. We also propose to use a vector 
space as a scheme of the internal representation — the mental space — of the 
system. This makes it easy to judge the similarity of kansei patterns, because 
the similarity can be easily measured by means of Euclidean distance between 
corresponding points in the vector space. 
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2.3 Reducing the Number of Colors 

A TV commercial image usually includes a number of colors. However, people do 
not distinguish most of them and index colors are determined by perceiving some 
similar colors as one color and by reducing their number [15]. In order to work 
with reasonable efficiency, the system should set an appropriate color resolution. 

Although there are various methods to extract index colors, such as the Pop- 
ularity algorithm or the MC algorithm [16], we adopt the following algorithm[17] 
to avoid losing minor colors that are perceptually important. 

1. Representation of color space 

All colors for every pixel in an image are input to RGB color space. Those 
colors are converted to L*a*b* color space, where distance between two 
colors is a reflection of human intuition. 

2. Making color list 

A color list for an image consist of the value of color space coordinates and 
its frequency, that is the number of pixels the color occupies. The color list 
is constructed by the mixing of two colors whose distance is under threshold 
p, regarding the frequencies of the colors. The number of colors registered in 
the list is reduced to 1 % of the number of input colors after the procedure. 

3. Noise removal and color reduction 

Since less frequent colors in the list are considered to be noise, if the frequency 
of a color is under 0.5% of all pixels in an image, the colors are mixed with the 
nearest neighbor color in the list in the way described above. Furthermore, 
two nearest neighbor colors in the list are mixed in the same way until the 
number of registration colors in the list becomes m. 

4. Color reduction by histogram 

A histogram for each axis L*a*b* is created by partitioning the axis into 2m 
regions and counting the number of colors in them. The nearest neighbor two 
colors in the list are mixed in the same way until the number of registered 
colors in the list becomes the maximum value among the number of convex- 
ity in each histogram because the number of convexity is considered to be 
the number of distinctive colors in an image intended by a designer [17]. Then 
the histograms are recreated and the maximum number of convexity is recal- 
culated. The procedure is iterated until the maximum number of convexity 
remains constant through the recalculation. 



2.4 Conspicuousness of Colors 

The index colors must be conspicuous. Conspicuousness of a color L is defined 
as follows. 



L = sj{e ■ Ci)2 + (/ . + {g ■ 

where C\ denotes frequency of the color, Ci denotes temptation of the color 
and C 3 denotes contrast of the color, and e, / and g are parameters. Ci, C 2 and 
C 3 are deflned as follows. 
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— Frequency 



^ Frequency of the color 

^ The number of pixels in the image 

— Temptation 

First the color is converted into HSV(Hue, Color Saturation, Value) color 
space. Then 

s-H+t-S+u-V 

C2 = ^ 

s + t + u 

where the H denotes the affection value of H (predefined for all colors) and 
s, t and u are parameters. 

— Contrast 

Let V denote the V value of the color created by mixing all the colors in 
the list but the color under consideration. Then 

C3 = iv-v'i 



2.5 Specifying Index Colors 

L is calculated for all the colors in the list. Index colors are n colors with larger 
L values in the list. Let us emphasize that we have introduced controllable 
parameters p, e, /, g, s, t, u and m for flexible modeling of images. 

3 Adaptive Mental Space 

3.1 Previous Works 

In machine learning approaches, it is widely considered that only a domain ex- 
pert who has great knowledge regarding how to solve the problems can design 
appropriate input information for a learning system. In other words, good mod- 
eling leads to sufficient learning performance. 

In previous studies, various modeling approaches were applied in order to ob- 
tain sufficient learner’s performance. One is feature selection, which reduces the 
number of features by selecting a subset of existing features [18]. Algorithms to 
select a subset have been developed include heuristics[19], exhaustive search[20] 
and Relief algorithm [21]. Among them, the Relief algorithm shows the best since, 
unlike exhaustive search, it does not involve expensive computation and, unlike 
heuristics, it does not suffer from poverty of concept description. But feature se- 
lection approaches assume that an initial feature set is given, and it will only be 
successful if this initial set is a suitable starting point for selection. Insufficient 
training instances fool Relief and it is important for Relief to pick real near-hit 
instances. 

The other modeling approach is feature construction to add the relationships 
between features by generating features as combination of features[18]. By gen- 
erating good new features, the number of peaks of the target concept is reduced. 
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The concept having few peaks is learned more easily than the concept that is 
spread out all over the instance space. It is reported that concept concentration 
can affect learning drastically [22]. But how to decrease the number of peaks of 
the target concept has not been studied. 

From the kansei modeling point of view, representing subjective perceptions 
is said to be necessary because each person has his or her own viewpoints. One 
approach is to construct a model that realizes a human-adaptive system dealing 
with subjective interpretation based on visual perception [8]. In that approach, a 
visual perception model is composed of a physical level and a mental feeling level 
and the subjective interpretations are represented by a ratio for each features 
in the physical level, such as color, direction and position in relation to stimuli. 
These ratio is calculated by statistics [7] or neural network [8]. In the sense that it 
specifies the relation between physical features and kansei patterns, it is related 
to our method. But there are still difficulties in performance to predict subjective 
interpretation according to their experiments in the way of evaluating learning 
algorithms. Furthermore they have not conducted the experiments to analyze 
constructed kansei model itself yet. 



3.2 Smoothness-Driven Adaptive Model(SAM) 

We consider the case that perceptual evaluation for data, such as subjective 
evaluation of a TV commercial, is given as a label with scale. Our TV commercial 
data include images in advertisements and consumers’ reports. We define kansei 
as discriminated subjective interpretation, which can be categorized as groups 
of adjectives, for our environments. The semantic differential(SD) method[23] 
provides us with the discriminated interpretations for our environments, such 
as preference or openness, and they can be represented as groups of paired 
adjectives such as like-dislike or simple-complicated. 

In the following ri{i = 1 • • • n) denotes raw data and Fj{ri) denotes the 
perceptual representation for each raw data based on the j-th feature. 

With the perceptual evaluation, the program reconstructs the space accord- 
ing to the scale of the evaluation. Let Ei > E 2 > ■ ■ ■ > En be evaluation values 
for each Ej{vi) first. Then Ej is adjusted to make a curved surface of E in the 
space it generates as smooth as possible. Here we introduce the following formula 
to consider the smoothness. 



dj^i — 



Eji'f'i+i) Ej(rj) 

Ei+i — Ei 



When Ej{ri) increase or decrease smoothly in terms of E, the formula returns 
dj^i close to a constant. The constant can be anything but 0, because with d being 
0 any inputs are indistinguishable. Let the constant be 1. We define the following 
function G{dj^i) to evaluate the smoothness of the curved surface. If d is near to 



0 , 



-A{d+ l-p) 



G{d) = (d+I-p)2 + 



_|_ ^d+B+l—p 
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Otherwise, 



G{d) 



{d-p)^ + 



-A{d - p) 
1 + ed+B-p 



A and B defines the shape of G. In the following experiments, we use A = 20 
and B = 2. p is a, constant to be set so that G takes the smallest value at d = 1 
in equation. The closer to 1 d becomes, the better an evaluation value is. G{x) 
also increases gently for a: > 1 and exponentially according to the absolute value 
of X for X <1. 

With G, the evaluation function for Fj is defined as the following. Here, the 
lower the value of Hj, the smoother the space is. 



n — 1 
2=1 



4 Empirical Results 

4.1 Experimental Preparation 

We introduce kansei scale[14] as a label with scale E. Kansei scale represents 
a human’s intuitive description of the images, which are denoted by paired ad- 
jectives. In this experiment a questionnaire was prepared for 100 representative 
samples using the 14 well-used kansei scales listed in table. 1. 60 men’s testees in 
their twenties answered these questions, that means the experimental result is 
based on men’s kansei in their twenties. 

Table 1. Paired adjectives. 



Bright - Dark 
Warm - Cold 
New - Old 
Vivid - Dull 
Hard - Soft 
Rustic - Urban 
Beautiful - Ugly 



Man - Woman 
Artificial - Natural 
Loud - Quiet 
Simple - Complicated 
International - Japanese 
Artistic - Scientific 
Open - Closed 



= >. S t s , 

£ I I a I 

Bright I 1 1 1 1 1 

Data 32 10 - 1-2 

Fig. 1. Scaling for paired adjective. 

Values from —3 to 3 were allocated respectively to each point on the ques- 
tionnaire’s horizontal scale. Figure. 1 shows bright ^ dark kansei scale, bright ^ 
dark evaluation (3, —3) which is one of the kansei scales and a larger positive 
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number shows stronger relationship toward bright, whereas a larger negative 
number shows stronger relationship toward dark. We define a set of TV com- 
mercial images corresponding to the value over 1.5 on a kansei scale by the testees 
as a positive cluster for that kansei, whereas a set of images corresponding to 
the value under —1.5 on it is defined as a negative cluster for that kansei. For 
example, TV commercial images corresponding to the value over 1.5 are labeled 
bright and those corresponding to the value under —1.5 are labeled dark for 
bright dark kansei scale. 

When we apply kansei scale to a label with scale E and reconstruct the 
space according to E to evaluate SAM, we adopt Genetic Algorithm(GA), that 
is, simplex crossover for real-coded GA, to optimize modeling parameters based 
on [24]. According to [24] a parameter set is represented as a chromosome in 
real-coded GA and the alternation of generation preserves the distribution of 
the population. It also shows that the performance is more advantageous than 
that by bit-string coding. 

The specification of GA in the experiment is as follows. 

— A parameter set is represented as a chromosome. 

— An initial population is uniformly generated in a certain range. 

— Mutation is not implemented. 

— The alternation of generation is performed by 

1. selecting randomly n individuals as a parent set from the population set, 

2. generating child individual by crossover of the parent set, 

3. selecting randomly 2 parents from the parent set, 

4. replacing 2 parents with an individual returning the most adapting value(G(x)) 
and an individual selected among 2 parents and a child. 



4.2 Similar Kansei Patterns 

In this experiment, we firstly place 110 TV commercial image data in the per- 
ceptual space by generating perceptual features and adapt the perceptual space 
by SAM based on men’s kansei scale in their twenties. 

The number of index colors of the images n is fixed to 2. Value and saturation 
of the two index colors are used to define the vector space. Furthermore, four 
attributes of a TV personality — age, body proportion, physical attractiveness 
and popularity — are also used. As a result, the adaptation takes place in an 
8-dimensional vector space. 

Figure. 2 shows how the parameter set is optimized to smooth a curved surface 
of bright dark in the space by GA. The initial population and the limit of 
the number of generations are set to 100 and 12000, respectively. The horizontal 
axis shows the number of generations in GA and the vertical axis represents 
smoothness evaluation G for each generation. As a result the smoothness became 
minimum G = 137.8483 where parameter fcO = 0.2, xO = 30.0021, kl = 15.0036, 
xl = 0.7, k2 = 100.021, x2 = 0.055, fc3 = 0.06, x3 = 50.0021. Table.2 shows the 
results of optimization based on 14 different scales. 
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Fig. 2. Optimization based on bright-dark scale by GA. 



Table 2. Optimized value in adaptation. 





personality 


color 


Bright - Dark 


137.8483 


615.3609 


Man - Woman 


142.0269 


1126.0935 


Warm - Cold 


144.8203 


786.4428 


Artificial - Natural 


108.8662 


905.5394 


New - Old 


126.2598 


430.0494 


Loud - Quiet 


148.2697 


563.5461 


Vivid - Dull 


162.5275 


844.4679 


Simple - Complicated 


99.6125 


552.5917 


Hard - Soft 


120.1523 


659.8007 


International - Japanese 


106.9665 


813.6292 


Rustic - Urban 


137.2776 


652.1892 


Artistic - Scientific 


72.3130 


470.1387 


Beautiful - Ugly 


106.7876 


777.5336 


Open - Closed 


163.6887 


889.2479 



We secondly specify similar kansei patterns in the perceptual space by com- 
paring kansei clusters and experimentally evaluate if the adapting perceptual 
space is appropriate for men’s kansei in their twenties. We also specify the fac- 
tors related to the similarity. 

We compare the kansei clusters in the space related to color and personality, 
respectively, by calculating the center of each kansei cluster and Euclidean dis- 
tance between the centers of the clusters as the dissimilarity of those clusters. 
Table. 3 indicates the 7 most similar kansei clusters in the perceptual space of 
color and personality. The experimental result in table. 3 shows that there are 
differences among similar kansei clusters in the space of color and personality. 
To determine which factor, color or personality, mainly affects kansei patterns 
we obtained information by means of questionnaires for 22 men’s testees in their 
twenties. The questionnaire is composed of 14 questions that ask if 2 images are 






Specification of Kansei Patterns in an Adaptive Perceptual Space 365 



Table 3. Similar kansei clusters in each space. 



rank 


personality 


color 1 


1 


woman 


— Open 


0.14477 


artistic — 


international 


0.06458 


2 


bright - 


— warm 


0.20881 


artificial — 


- bright 


0.07406 


3 


natural 


— quiet 


0.28458 


artistic — 


bright 


0.09225 


4 


quiet — 


■ simple 


0.31252 


woman — 


simple 


0.09273 


5 


bright - 


- open 


0.32875 


scientific - 


— soft 


0.09397 


6 


natural 


— simple 


0.33772 


loud — bright 


0.10438 


7 


beautiful — woman 


0.36461 


artificial — 


- loud 


0.10653 



similar, dissimilar or illegible. Those 2 images in each question are representa- 
tive images in the similar kansei clusters in the space of color or personality in 
table. 3. The result of the questionnaire is shown in table. 4. The result shows 
that people mainly judged similar kansei clusters in the space of personality to 
be similar. Personality in the TV commercial, rather than color, strongly affects 
to kansei patterns for men in their twenties. 

Table 4. Evaluation of similar kansei clusters in each space. 



1 personality 


color 1 


kansei pattern 


sim. 


dissim. 


illeg. 


kansei pattern 


sim. 


dissim. 


illeg. 


woman — open 


17 


0 


5 


artistic — international 


6 


12 


4 


bright — warm 


12 


4 


6 


artificial — bright 


0 


19 


3 


natural — quiet 


15 


1 


6 


artistic — bright 


3 


16 


3 


quiet — simple 


7 


10 


5 


woman — simple 


3 


16 


3 


bright — open 


5 


10 


7 


scientihc — soft 


2 


20 


0 


natural — simple 


9 


5 


8 


loud — bright 


1 


20 


1 


beautiful — woman 


18 


1 


3 


artificial — loud 


3 


18 


1 



(the number of people) 



Figure. 3 shows 7 similar kansei clusters in the perceptual space of personal- 
ity. Although most kansei clusters are characterized by age or proportion, warm 
and bright kansei clusters are mainly characterized by popularity. 4 similar kan- 
sei patterns, such as woman ^ open, bright ^ warm, natural quiet and 
beautiful ^ woman, are affected the similarity of popularity and woman ^ 
open is also affected the similarity of age. 

5 Conclusion 

In this paper we proposed the use of a vector space as a scheme for the internal 
representation of the system, where the features of index colors and TV person- 
alities in TV commercial images are introduced as perceptual axes of the space. 
We also represented the TV commercial images in the perceptual space based on 
the kansei scale and analyzed the similarity of kansei clusters using their distance 
in the perceptual space. 






366 T. Murakami, R. Orihara, and N. Sueda 



•ropertw 








•llr«ctiva 



Fig. 3. Perceptual space of personality. 



We constructed an experimental system for still image perception to identify 
the similarity of kansei patterns in the space and to investigate its impact on 
real-world applications. The data adaptation was performed based on the kansei 
scale in the perceptual vector space, so that a kansei cluster is compactly rep- 
resented in the space, rather than spread out all over the space. This is done 
by creating perceptual axes based on the kansei scale. The axes help to create 
a space with fewer peaks where characteristics of the kansei can be identified 
easily. We specified not only similar kansei but also their factors related to the 
similarity by comparing the kansei cluster in the perceptual space. The result 
shows that personality strongly affects the similarity rather than color in TV 
commercial images. 

However, we have to say that our study is still in its preliminary stage. We 
intend to carry out the experimental studies with SAM based on other kansei 
scales. We should also extend the range of data we can handle, i.e., extend 
the method so that it can handle multimedia data such as movies and sounds. 
Our final goal is to construct a decision support system, which can analyze 
multimedia data from various viewpoints that are implemented as parameters 
of modeling functions. 
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Abstract. Agent-based models of human operators rarely include ex- 
plicit representations of the timing and accuracy of perception and ac- 
tion, although their accuracy is sometimes implicitly modelled by in- 
cluding random noise for observations and actions. In many situations 
though, the timing and accuracy of the person’s perception and action 
significantly influence their overall performance on a task. Recently many 
cognitive architectures have been extended to include perceptual/motor 
capabilities, making them embodied, and they have since been success- 
fully used to test and compare interface designs. This paper describes 
the implementation of a similar perceptual/motor system that uses and 
extends the JACK agent language. The resulting embodied architecture 
has been used to compare GUIs representing telephones, but has been 
designed to interact with any mouse-driven Java interface. The results 
clearly indicate the impact of poor design on performance, with the agent 
taking longer to perform the task on the more poorly designed telephone. 
Initial comparisons with human data show a close match, and more de- 
tailed comparisons are underway. 



1 Introduction 

Although it is difficult to find a definition of a software agent that all researchers 
will agree upon, one aspect that seems to be universally accepted is that an agent 
is situated — it operates within an environment that it senses in some way, and 
in which its actions are performed. Despite this agreement on the importance of 
being situated, when it comes to using software agents to model human operators 
the details of perception and action are too often ignored. 

In many cases, software agents are simply given perfect vision, able to see 
all objects within their field of vision equally clearly, and precise action, with 
every action being completely accurate and instantaneous. For some types of 
simulation, these simplifications may have little impact on the results, but in 
many applications the effects can be significant. In human-computer interaction, 
for example, the time taken to find an object on the display and move the 
mouse to this object can be significant in the overall timing of the task, even for 
experts. In a driving simulation, the accuracy and speed of steering might make 
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the difference between safe driving and an accident. As Gray discusses [9], small 
differences in interface design can have a significant impact on the time taken to 
perform common tasks. 

Perceptual/motor extensions to cognitive architectures, most notably ACT- 
R/PM [4], have allowed researchers to build models that interact with simula- 
tions of the interfaces an operator would use (such as Salvucci’s work on tele- 
phone use while driving [15]), and in some cases with the interface itself (e.g. the 
work of Byrne [6,5] and Amant and Riedl [2] on user interface evaluation). The 
growing interest in this approach is illustrated in a recent special edition of the 
International Journal of Human- Computer Studies [14]. Although a significant 
amount of work has focused on GUI testing and evaluation, there are also models 
which manipulate (simulations of) physical objects, e.g. [11,15]. These studies 
all illustrate the importance of including perception and action in the model in 
order to get a better match between the model and the operator being modelled. 

This paper describes an implementation of an initial set of functional per- 
ceptual/motor capabilities with the JAGK agent language [1]. An agent with 
these capabilities was used to compare graphical representations of telephone 
interfaces, such as those shown in Fig. 1. Although we limited the motor capa- 
bilities to simple mouse movement and clicking (this is all that was needed for 
the interface), the addition of further motor abilities will now be straightfor- 
ward. These capabilities will be particularly useful in the JAGK agent language 
because it is designed for modelling human operators. These capabilities allow 
a more complete model of the operator. 
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Fig. 1. Two sample interfaces with which the agent and human can interact 



In the remainder of this paper, we first discuss perception and action from the 
perspective of interaction with these GUIs, and then discuss our implementation 
of perception and action using JAGK. We present the results showing the impact 
of simple good and bad GUI designs on agent performance, and some preliminary 
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work showing that the embodied JACK agent’s performance predicts the human 
performance on the same interfaces. From these results, we note how including 
perceptual/motor components helps to model human operators, and that similar 
effects will influence models of human operators in other types of environments. 



2 Interaction with Example Interfaces 

The interfaces in Fig. 1 require only simple interaction: reading the instruction 
at the top of the window, performing the appropriate sequence of mouse clicks 
(which always ends with clicking “OK”), then getting the next instruction, and 
repeating this loop until “Finished” appears in the instruction area. The in- 
terface does not require any keyboard input, nor does it include any complex 
mouse navigation, such as pull-down menus. For a description of how keyboard 
interaction could be included in the model, see the work of Baxter, Ritter and 
their colleagues [3,13] or John [10]. 



2.1 Visual Perception 

The model of visual perception added to the agent corresponds to the three re- 
gions people have in their field of view. The first of these is the fovea, a narrow 
region approximately two degrees in diameter, which is the region of greatest 
visual acuity. The next is the parafovea, which extends approximately five de- 
grees beyond the fovea, and provides somewhat less acuity. For example, if a 
button lay in the parafovea, the operator would probably see the shape, size and 
location of the button, but not recognise the label on it. The remainder of the 
field of view is known as peripheral vision. Perception in this area is extremely 
limited — the operator would probably see that an object was there, but not be 
able to pinpoint its exact location without shifting focus. (This is a necessary but 
gross set of simplifications. There are many more subtleties and regularities.) 

Because of these limitations, an operator will not be able to clearly perceive 
the entire interface simply by looking at a single point on it. The eye will have 
to shift focus in order to perceive the different objects on the display. This 
is achieved through saccadic eye movements, during which the eye effectively 
“jumps” from one focus to another. For saccades less than 30° (which covers 
all saccades with our interface), the “jump” takes about 30ms, during which 
the operator is effectively blind, followed by a longer fixation on the new focus, 
typically of about 230 ms duration [7]. 

These capabilities and limitations allow models in JACK to find information 
on interfaces, but they require effort. The model must know where to look, or it 
must search: it must move its eye, and it must then process what it sees. These 
efforts take time and knowledge, corresponding to similar time and knowledge 
that the operator has. 
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2.2 Manual Input 

The only manual input required for this interface is mouse movement and click- 
ing. Mouse movements by operators are not accurate, relying heavily on visual 
feedback. Rather than moving in a single linear motion to the object, the human 
operator will move the mouse in a series of shorter segments, with a correction 
at the end of each one, until the mouse pointer appears on the target. Studies 
have shown that each of these segments has length approximately 1 — ed, where 
d is the remaining distance to the centre of the target, and e is a constant 0.07 
[7], p. 53. Each of these movements takes roughly the same amount of time, ap- 
proximately 70 ms, plus a delay for visual feedback, before the next movement 
begins. This means that although the final position of the mouse will be within 
the target, it will not necessarily be at the centre of the target, as shown in Fig. 
2 . 



target 




Fig. 2. Moving the mouse to a target 



Of course, because of error in the movement, the distance will not be exactly 
1 — ed, nor will the point lie directly on the line between the origin and the 
centre of the target. Unfortunately, as discussed by MacKenzie et al [12], while 
there have been many studies which report error rates when using a mouse or 
other pointing device, very few report the types or magnitudes of errors. We 
have extrapolated from the results of MacKenzie et al to get a mean variability 
in final position that is equal to 5% of the distance travelled — further studies 
are required to confirm this figure. 



3 Implementation 

The implementation of the system consists of two parts: a simple GUI that was 
used for testing purposes, and the embodied JACK agent that interacts with 
this GUI. 
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3.1 Interface 

The telephone GUI was written in Java 1.3 using Swing components. The user 
can specify the size of the telephone buttons and the spacing between the buttons 
as command line arguments. A transparent pane overlays this GUI, and it is via 
this pane that the agent interacts with the GUI. When the agent “looks” at 
the GUI, the pane returns the details of objects in the fovea and parafovea. 
When the agent moves or clicks the mouse, the pane passes this information to 
the GUI. The eye position of the agent is displayed on the pane, as well as the 
current position of the agent’s mouse pointer. Although the agent was tested 
only using the telephone GUIs, it is designed to interact with any mouse-driven 
GUI written in Java, by overlaying this same pane. 

A control panel is also provided to control the agent and test the interface. 
(See Fig. 3.) This allows the user to adjust the fovea and parafovea size, switch 
between a crosshair display for eye position or a full indication of fovea and 
parafovea boundaries, manually control the eye and mouse positions (for testing 
purposes), disable the controls completely (to interact directly with the tele- 
phone), and create or destroy the agent. The objects that the agent can see 
(both in the fovea and parafovea) are displayed on the control panel, as well as 
the actions that have been performed so far. 
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Fig. 3. An agent interacting with the interface, and the associated control panel 
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3.2 Agent 

The agent was written in the JACK agent language, a language implementing 
a BDI (beliefs-desires-intentions) architecture as an extension to the Java pro- 
gramming language. Other than the perception and action capabilities, the agent 
is extremely simple, with just two plans: one that interprets the instructions and 
another that dials a number (retrieving it from memory) . 

Interaction with the GUI is provided through two capabilities: a vision capa- 
bility which controls eye position and fixations, and an action capability which 
controls mouse position and clicks. 

The vision capability can achieve three goals: to look at a particular object on 
the screen, to look at a particular position on the screen, and to simply observe 
at the current eye position, storing information to memory. The times for eye 
movements and fixations are included using JACK Owaitfor (izme) statements, 
so that the agent takes the appropriate time to achieve these goals. 

Similarly, the actions capability can achieve a limited number of goals: to 
move the mouse to a particular point or object, to click on a particular object, 
and to click at the current mouse position. Timing to perform these tasks is 
incorporated for these actions as for the eye movements. 

4 Comparison of Model Predictions and Human Data 

For preliminary testing, we created a short sequence of tasks for both the agent 
and human operator to perform using our telephone interface. These tasks were 
displayed in the instruction section at the top of the interface. The instructions 
(in order) were: 

- To start, click OK 

- Dial home, then click OK 

- Dial work, then click OK 

- Redial the last number, then click OK 

- Dial directory enquiries, then click OK 

- Call your mother, then click OK 

After this sequence, “Finished” appeared in the instruction section. The user 
was told in advance which numbers they would be asked to dial, and in one case 
“your girlfriend” was substituted for “your mother” because the user did not 
know that number — the aim was to use numbers that were known “by heart,” 
so that the time to recall them was short and uniform. 

Each of the three users was asked to perform this sequence of tasks 20 times 
— 10 for the interface with the “standard” size buttons, and 10 for the one 
with small, widely-spaced buttons. (The two interfaces in Fig. 1 show the scale 
of differences but are smaller than the real size of 5.5 cm by 9.5 cm.) Every user 
encountered the standard interface first. The users were instructed not to pause 
between starting a sequence of tasks and seeing “Finished” appear, but to take 
as long as they wished between sequences. The time was recorded each time the 
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user clicked “OK.” The results for each user were then averaged over each group 
of 10 sequences. 

The JACK agent performed the same sequences of tasks, but in this case there 
were 30 repetitions for each version of the interface, and these were averaged. 
Because the number dialled can have a significant impact on the performance of 
this task (e.g. “555” is dialled more quickly than “816”), the agent was compared 
against individual users, dialling the same numbers, rather than aggregating all 
users [8]. Figures 4 and 5 show results from two subjects — the first is the worst 
fit of all subjects, and the second is the best. 




Task number 



- -X - Human - small buttons 
-A- Agent - small buttons 
^ Human - large buttons 
— Agent - large buttons 



Fig. 4. Time taken to perform the sequence of tasks (the worst fit, subject 1) 



In all cases, the time taken to perform the tasks was significantly lower for 
both the human and agent using the GUI with large buttons. The agent has a 
tendency to out-perform the human user on both interfaces, and we suspect that 
this is because the error that we introduce during mouse movement is too small. 
As mentioned previously, further studies are needed to get an accurate figure 
for the magnitude of the error. The raw data (not presented here) also shows 
more variation in the human timing than that of the agent, further reinforcing 
the suspicion that our error magnitude is too small. 

These results are only preliminary results, and we have only used a very 
small sample of three users, but these results are extremely promising. We are 
now gathering more detailed human data, logging all mouse actions, and using 
an eye tracker to record eye movements, so that more detailed comparisons 
between the users and the model can be made. We will also collect more data 
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- -X- - Human - small buttons 
-A- Agent - small buttons 
Human - large buttons 
— Agent - large buttons 



Fig. 5. Time taken to perform the sequence of tasks (the best fit, subject 3) 



on the magnitude of errors in mouse movement. The detailed comparison will 
allow us to further validate the model. 

5 Conclusion 

The work presented here represents a first step in embodying a JACK agent, 
giving it the ability to interact with a GUI by “looking” at the interface, seeing 
it as a human would, and moving and clicking a mouse on the interface. As 
discussed, the early results are promising, and we expect the more detailed com- 
parison with human users to further refine the model. Although we have focused 
on visual perception and mouse input, other modes of perception and action 
could be added in a similar fashion, using the vast wealth of human engineering 
data that has been collected over the years. 

The initial results here clearly indicate the impact of a “bad” user interface 
design, with the agent taking significantly longer to perform the task on the bad 
interface (as did the human users). Our results suggest that an embodied agent 
of this type can be used to test user interfaces, in time eliminating much (though 
probably not all) of the costly user testing stage of GUI design. 

Another application of an agent that is embodied in this way is in a simu- 
lation environment where the agent replaces a human operator, for example, in 
a training simulator. If the agent does not have accurate delays for its actions, 
or perceives the environment in an unrealistic manner, the value of the training 
may be questioned. The trainee may develop unrealistic expectations of their 
team members’ abilities, or they may use tactics that would be unnecessary or 
inappropriate to beat a real world opponent. 
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The capabilities we have added to the JACK agent make it more situated, 
interacting with its environment in a manner that more closely matches the 
human operator being modelled. This embodiment of the agent gives more real- 
istic performance, making the model suitable for a broad range of applications 
in which the timing and accuracy of perception and action will have a significant 
impact on the performance of the agent. 
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Abstract. It is accepted that ontologies are vitally important for inter- 
operability and information integration. The major part of every ontol- 
ogy is its taxonomy, a hierarchy of the kind-of relation. A conceptual 
relation seldom omitted from ontological considerations of a domain is 
the part-of relation. Guarino and Welty provide an ontology of proper- 
ties which facilitates dealing with kind-of — we summarise their proposal 
from an order-theoretic perspective, and employ it to address part-of. 
We propose criteria and analyse the resulting classifications of the part- 
of relation. The result is a step towards an ontology of part-of. 



1 Introduction 

It is commonly accepted that ontologies are crucial for successful interoperabil- 
ity and information integration [10,5,6]. There is however no consensus yet on 
what ontology is — we believe [5] provides a proper way of defining an ontology, 
and improves upon [3] . Whichever definition of ontology one accepts, building an 
ontology includes building a taxonomy (of terms). We refer to the relationship 
between terms that results in a taxonomy hierarchy as a kind-of relation and 
denote it with a symbol U. Guarino’s definition of ontology [5] involves a domain 
of objects and a set of conceptual relations on the domain. Selecting an appro- 
priate set of relations is application dependent. There is however a relation that 
is seldom omitted from ontological considerations on the domain, the relation of 
part-of, sometimes referred to as part-whole relation. We denote part-of with a 
symbol 

The paper is structured as follows. In Section 2 we mainly review the results 
of Guarino and Welty [7], but present them from an order-theoretic perspective. 
In Section 3 we make use of the results reviewed in Section 2, and decide to 
focus on ^ restricted to types. We propose criteria for classifying A relations, 
and derive the resulting classifications. We then compare our results with those 
proposed by others [13,11,8]. We conclude the paper with directions for future 
research. 
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2 Kind-of Relation C 

This section considers the kind-of relation, denoted C. The discussion is based 
on research described in [7]. The research offers an ontology of properties, and 
therefore is crucial for understanding taxonomies and ontologies. 

We provide an order-theoretic view of the ontology. Adding explicit orderings 
(on criteria for classifying properties, and on properties themselves) can enhance 
and facilitate our understanding of the ontology, and becomes a real advantage 
when the proposed ontology of properties needs to be modified or extended. 

The classification of properties offered in [7] is essential for understanding 
taxonomies; it is also helpful in our analysis of the part-of relation in Section 3. 



2.1 Criteria on Properties for C 

The criteria considered in [7] are those of identity (I), rigidity (R) and depen- 
dence (D), see Table 1. 



Table 1. Criteria on and classification of properties for C 
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It seems beneficial to see criteria as ordered sets of meta-properties (prop- 
erties of properties), e.g. non-rigidity is a property of (the property) student] 
apart from rigidity (denoted -|-i?) and non-rigidity {—R), there are anti-rigidity 
{®R) and strict non-rigidity {<Z>R), and there is an ordering on these. The three 
orderings are depicted in the upper part of Table 1. The definitions [7] follow. 

Identity is a binary relation that holds between two objects if they are the 
same, or equivalently, that does not hold between two objects if they are different. 
A named identity relation is called an identity condition. We say that t is an 
identity condition for p, if for all objects x,y of p we have that l{x, y) ^ x = y. 




380 C. Nowak and R. Raban 



A given property p can have not only the meta-property +I of identity, but 
also a meta-property +0 of own identity — this happens when a property (class) 
introduces its own identity. Therefore, the criterion of identity can be seen as an 
ordered set of meta-properties, namely (I, <) = ({•/, +1, —O, +0, +1—0, —O}, 
<), i.e., I has 6 elements, and for instance •/ < +I < +0. The relation < is 
an information ordering, +0 is more specific (carries more information) than 
+1, which is in turn more specific than I (we assign a meta-property I to a 
property p, if we don’t know whether p has identity) . 

A rigidity criterion R includes the meta-properties of rigidity (denoted +R), 
non-rigidity {—R), anti-rigidity {®R), and strict non-rigidity (0R), where a 
property is strictly non-rigid if it is non-rigid but not anti-rigid. Therefore, (R, < 
) = ({-i?, +R, —R, 0R, 0R}, <), for the ordering see the upper part of Table 1. 

A property is rigid if instances of the corresponding class must necessarily 
be instances of the class [7]. Let p(x) denote the fact that an object x has the 
property p. Then p is rigid (has +R) if Va, p{x) \Jp(x), where □ is a modal 
necessity operator, in this context usually seen as a temporal necessity operator 
always. Then p is non-rigid (has —R) if 3^ p{x) A ^□p(a:) (there are objects 
that can move out of the class p, without ceasing to exist). A property p is anti- 
rigid (0R) if Va; p{x) ^□p(x), for every object in p there is a world (time 
moment) in which it is not in p (note that ^□p(cc) = C'^p(x), where O denotes 
the possibility operator). 

Dependence is defined as follows. A property is dependent if every object in 
the corresponding class require an object of another class to exist. We say that 
Pi depends on p 2 if V^, p\{x) 3y P2{y), where pi yf P 2 and x and y are not 
parts of each other. 

The above definitions can be summarised as follows. 

- t is an identity condition for p if 
"^x,y p{x) Ap{y) {L{x,y) ^ x = y); 

-bis p’s own identity condition if it is an identity condition for p not inherited 
from p 2 3 P', 

- p has +I if it has an identity condition; 

- p has +0 if it has its own identity condition; 

- p has -l-i? if Va; p{x) — > np(x); 

- p has 0R if Va; p{x) ^□p(x); 

~ p has 0R if it has —R but does not have 0R] 

- p depends on p 2 if Va, p{x) 3y P 2 {y), 
where pi yf p 2 and x, y not parts; 

- p has +D if there is p 2 such that p depends on p 2 ', 

- p has —IfOiRfD if it does not have +IIOlRjD-, 

Given our interest in C we need to know whether identity, rigidity and de- 
pendence get inherited, because this would help us to decide whether a given 
property subsumes (is more general than) another property. We return to this 
issue in Section 2.3. 
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2.2 Classification of Properties for C 

Given the criteria described in Section 2.1, properties can be classified as shown 
in Table 1. 

The resulting taxonomy of properties for C is presented in Table 2, taken 
from [7]. Notice that the taxonomy has a form of a tree. Notice also that Table 1 
shows only leaves of the tree of Table 2. In both tables we indicate whether a 
given property carries an identity condition, is rigid, is dependent. 



Table 2. Taxonomy of properties for C 



property {p) 

sortal (+/) (s) 

essential (+R) (e) 

type (+0) (t) 

merely essential sortal (— O) {u) 
non-essential {—R) (n) 

type-attribute mixin {—D, —O) (x) 
anti-essential (07?) (i) 

phase sortal (-1-0, —D) {h) 

material role (+D) (m) 

formal property (—7) (/) 

category (-1-7?) (c) 

attribute (—77, —7?) (a) 

formal role (07?, -1-77) {1) 



Given that the criteria /, R and D determine the ordered sets (/, <), (7?, <) 
and (77, <) it is natural to consider the product 7 x 7? x 77 of the three ordered 
sets. An ordering on properties derived from the I x Rx D ordering is presented 
in Figure 1 — it gives more information about the classification than Table 2 does. 
For instance, it not only shows that t and u are essential sortals (below e), but 
also that t and h join at a node stricly below s (the node that could be called 
s+o, a sortal that has +0), and that u and x join at a node stricly below s 
(the node that could be called s_o, a sortal that has —O). Also, formal role 
and material role are subsumed by role — the corresponding node can be found 
in the bottom part of Figure 1 as the join of m and 1. The ordering can enhance 
our understanding of the classification of properties for C. 

An interesting, order-theoretic method for data anaylysis is offered by Formal 
Concept Analysis (FCA) [2,1]. We employ FGA to analyse properties (employed 
as FCA-objects) and meta-properties {FCA- attributes) — see Figure 2. 

Figure 2 shows a resulting concept lattice. For instance, a node marked h 
represents an FGA-concept “phasal sortal.” This FGA-concept consists of an 
FCA-extent and an FCA-intent — its extent is {h} (collect FGA-objects below 
or at the node), its intent is {-bO, +/, — 77, 07?} (collect FGA-attributes above 
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Fig. 1. Ordering on properties for □ 




Fig. 2. FCA ordering on properties for C 



or at the node). One can also read from the lattice that, for instance, t+u and 
t-D both have +0, +I and also +R, but only t-o has —D. 



2.3 Classification of Kind-of Relations 

We need to classify not only properties, but also sub-relations of C. Figure 3 
shows the containment between various subsets of P obtained by classifying 
properties for C — the containments correspond exactly to the taxonomy on 
properties presented in Table 2, but the only subsets of P that now interest 
us are those corresponding to the leaves of the taxonomy, namely the subsets 
T, U, X, H, M, C, A and L; we need answers to questions like the following: if 
Pi and p2 are elements of those subsets, is it the case that pi subsumes, or is 
subsumed by, p 2 ? 
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Fig. 3. Containment & taxonomy hierarchy 



As mentioned in Section 2.1, we need to know whether identity, rigidity 
and dependence get inherited. It is easy to see that identity and dependence 
is inherited. Rigidity is not inherited. However, anti-rigidity is inherited; to see 
this, let p be anti-rigid, p □ g, and x be an element of the domain. We have 
that q{x) implies p{x) (because p 3 <z), implies O^p(x) (by anti-rigidity of p), 
implies C'^q(x) (because p □ q ) — hence anti-rigidity of q. Therefore, we have 
the following: 

- sortals never subsume formal properties; 

- anti-rigid properties never subsume rigid properties; 

“ dependent properties never subsume independent ones. 

The above restricts subsumption relation between elements of P. In particular, 
categories are never subsumed by types. The practice of ontological modeling 
tells us that categories always subsume types (simply put entity at the top of 
the taxonomy). Hence, we can consider = 3 |t,xc> similarly, one can introduce 
Qut- Such results add structure to taxonomies, i.e., they allow to partition C by 
restricting it to properties for which it holds. 

Taxonomy hierarchy is an ordered set of properties P, with Cl being the order 
relation, i.e., it is (P, □). Whenpi C p 2 , this is sometimes represented graphically 
by drawing a (pointing upwards) arrow line —t> from the node representing pi to 
the node representing p 2 , i.e., pi — > P 2 - Sections 2.2 and 2.3 discussed classifying 
properties (for C), and classifying sub-relations of Cl- We have identified, for 
instance, two more specialised kind-of relations 3tc and 'Quu with kind- of links 
from types to categories, and from merely essential sortals to types. Selecting 
such specific subsets of 3 adds structure to the process of taxonomy building. 

Figure 3, the right hand part, shows a fragment of the taxonomy building 
process. As suggested in [7], one should first construct inheritance (3, kind-of) 
links from types to categories, and from merely essential sortals to types. After 
having this done phase sortals should be added, resulting in a backbone ontology. 
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3 Part-of Relation -< 



In this section we make use of the results reviewed in Section 2. In particular, 
given the ontology of properties proposed in [7], we focus on properties called 
types, i.e., properties that are rigid and carry identity eonditions. We see ^ as a 
relation on properties, i.e., di Q P ^ P, where P is a set of properties. Therefore, 
given that TCP, where T denotes a set of types, we restrict our attention to 

— Itxt’ 

We propose criteria for classifying ^ relations, and derive the resulting clas- 
sifications. Two main criteria we consider are those of exclusiveness and essen- 
tiality. 

When considering parts and wholes it seems that the notion of dependence is 
highly relevant. We base our criterion of essentiality on identity and functionality 
and it is therefore the essentiality criterion that strongly connects to dependence. 
One can certainly ask “dependence-oriented” questions. Does the part “identity- 
depend” on the whole, i.e., does it cease to exist when the whole does? Does the 
part “functionality-depend” (stops functioning) on the whole? Does the part 
“location-depend” on the whole, i.e., can we locate the part (separability issue) 
given that we know the location of the whole? 

Given two sets, the set of parts and the set of wholes, the term “exclu- 
siveness” relates to certain properties of mappings between the sets of parts 
and wholes. For instance, there can be a 1-1 onto mapping between parts and 
wholes, or the mapping from wholes to parts is 1-1 — in the latter case we have 
that “wholes do not share parts,” or that given a part and its whole, the part 
belongs “exclusively” to that whole. Although dependence can mainly be cap- 
tured by essentiality and separability, one can make use of exclusiveness, as well, 
because it tells us whether parts (wholes) can be shared. If a part is shared by 
two wholes, does one whole depend on the other one? If two parts share a whole, 
does one part depend on the other one? 

Our criteria of exclusiveness and essentiality not only provide a way of clas- 
sifying <, but also order various kinds of ^-relation. When one needs to con- 
sider a specific part-of relation, one can find its place in the exclusiveness and 
essentiality orderings. The advantage of being able to do the classification is 
that different inferences can be performed in different part-of relations. This is 
similar to the driving force behind the problem of transitivity of part-of [13, 
8] — computing transitive closure of part-of can give us useful inferences, but it 
only makes sense if the relation is transitive, for otherwise invalid inferences are 
derived. But transitive closure on part-of (finding parts of the given whole, and 
parts of the parts, recursively) is not the only way to obtain useful inferences. 
One might also be interested in updating the database of assets when some of 
them cease to exist (identity), stop to function (together with identity it gives 
our essentiality), or their parts get separated (where separation either affects 
identity / functioning or not). 

We then compare our results with those proposed by others [13,11,8]. We 
also indicate directions for future research. 
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3.1 Exclusiveness 

In this section we discuss a criterion called “exclusiveness” and obtain a resulting 
ordering, as presented in Figure 4 . 




Fig. 4 . Exclusiveness ordering for ^ 

Suppose that given two objects xi and X2, we can decide^ whether xi is a 
part of X2, denote this by xi < X2- Let ^ denote a given part-of relation we are 
interested in. We understand ^ as a relation on properties, i.e., :< C P x P, or 
more specifically :< C T x T. Then, when we say that pi P2i what we mean 
is that an object xi that has pi (that is, pi{xi)), is a part of X2 that has p2- 
Let Xi collects all parts — objects having pi which are parts of objects having 
P2 — and let X2 collects all wholes. The criterion of exclusiveness is obtained by 
considering what mapping between X\ and X2 the relation ^ offers, i.e., is it 
1 - 1 , onto, totall This leads to the following definitions. 

(a) Pi P 2 iff Xi < X 2 

(b) Pi dib P2 iff '^xi^x^ Xi < X2 

(c) Pl<cP 2 iff "^X 2 ^xiXi<X 2 

(d) Pi did P2 iff Pi db P2 A Pi dc P2 

(e) Pi de P 2 iff Xi < X 2 

(f) Pi df P2 iff '^x-^'^^-xi Xi < X2 

(g) Pi dg P2 iff Pi de P2 A Pi dd P2 

(h) Pi dh P2 iff Pi df P2 A Pi dd P2 

^ We can decide whether xi is a part of *2 for instance on the base of whether they 
are identity/functionality/separability dependent on each other, see Section 3 . 2 . 
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Figure 4 presents subsumption relation between the definitions, e.g., <a is more 
general than because if p\ :<t P2 then we necessarily also have that pi P2- 
A simple version of the ordering could be obtained by considering a product of 
the linearly ordered sets {a,b,e} and {o, c, /}, obtaining {a,b,e,c, f,d, g,h,i} 
(black-filled nodes). When one also introduces :<j and then the result is 
{a,b,e,j} X {a,c, f,k}, shown as a small diagram in Figure 4, and its bigger 
variation; this classification is a preliminary one, as it is still incomplete. 

We provide some comments and examples on the definitions: 

(a) some piS are parts of some p2S, e.g.: diesel-engine :<a car; 

(b) all piS are parts of some p2S; 

(c) all P2S are wholes for some pis; 

(e) all piS are not-shared parts of some p2S, e.g.: carburator engine; 

(f) all P2S are not-shared wholes for some piS, e.g.: engine car; 

(i) all piS are not-shared parts of some P2S, all P2S are not-shared wholes for 
some piS, mapping between piS and p2S is 1-1 onto, e.g.: mind :<i person; 

(j) all piS are parts of some p2S, some parts piS shared by wholes p2S, 
e.g.: computer-printer <j computer-network; 

(k) all P2S are wholes for some piS, some wholes p2S shared by parts piS, 
e.g.: engine :<k speed-boat; 

(n) all piS are parts of some p2S, all p2S are wholes for some piS, 

some parts piS shared by wholes p2S, some wholes p2S shared by parts piS 
e.g.: processor computer, 

(o) all piS are parts of some p2S, some parts piS shared by wholes p2S, 
all P2S are not-shared wholes for some piS, e.g.: heart :<o person; 

(p) all P2S are wholes for some piS, some wholes p2S shared by parts piS, 

all piS are not-shared parts of some p2S, e.g.: spark-plug petrol- engine. 



3.2 Essentiality and Separability 

In this section we discuss a criterion called “essentiality” (see Figure 5) and 
derive a resulting ordering (see Figure 6). 

Consider the top part of Figure 5. We intend to capture how identity / func- 
tionality changes of the part (whole) affect the whole (part). Let -\-f denote the 
fact that the part functions properly, — / that it does not (i.e., it has lost some 
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Fig. 5. Essentiality criterion 



functionality), and —i that it has lost its identity (and therefore, it ceased to ex- 
ist); similarly, we employ +F, —F, —I for the whole. To capture the identity and 
functionality dependence, consider the functions d \ {— /, — i} — > {+T', —F, — /} 
and D: {—F,—I} — > {+/,—/, —t}- We use d to describe how the whole depends 
on its part, and D to describe how the part depends on its whole. For instance, 
if d{—f) = —F then we say that when the part stops functioning properly, then 
so does the whole. Given that we can order {+f, —f, — 1 } and {-PF’, —F, — /} to 
indicate how drastic the change to the part and whole is, +f < —f < —i and 
+F < —F < —I, it is appropriate to require that both mappings are order- 
preserving, so we restrict d and D to functions that satisfy d{—f) < d{—i) and 
D{—F) < D{—I), respectively. For instance, one would not like to allow both 
d{—f) = —I and d{—i) = —F, because if the part ceases to exist {d{—i)), then 
it also ceases to function (d(— /)), and therefore we should have d{—i) = —I, 
(because d{—f) = —I) rather than only d{—i) = —F. Therefore, we get the 
following functions d\, ... ,de. 

= {(-*, -/),(-/,-/)}, 
d, = {{-z,-I),{-f,-F)}, 
d3 = {(-^,-/),(-/,+F)}, 
d4 = {{-i,-F),{-f,-F)}, 
d5 = {(-^,-F),(-/,+F)}, 
de = {(— b +F)j (—/) +^)}- 

and we can order them: di > dj iff di{—i) > dj{—i) or di{—i) = dj{—i)/\di{—f) > 
dj{—f). In an analogous way we obtain D\, . . . , Dq and an ordering on them. 
Figure 5 illustrates the functions, and provides examples. 

Given the ordered sets of functions {di}i and {Di}i the essentiality ordering 
is given by {di}i x {Di}i, a product of the ordered sets of functions. This is shown 
in Figure 6, with examples of Figure 5. In the figure three nodes are marked mp, 
ec and hs, and represent the examples of Figure 5, namely, mind F person, 
engine ^ car and hull ^ ship, respectively. 

In this way we can capture whether parts and wholes depend on each other — 
w.r.t. identity and functionality. The final result is that we can, given a particular 
part-of relation, see how strong the dependence is. It is important to make use 
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Fig. 6. Essentiality ordering for ^ 



of this information: depending on where our specific ^ is, different computations 
might be performed for 




Fig. 7. Separability ordering for ^ 



Figure 7 presents a simple way of ordering ^-relations w.r.t. separability — 
when a part and its whole are separated, do they still function properly, or do 
they loose their functionality or identity? The same notation as in the discussion 
of essentiality is employed. For instance, marking a node labelled (+/, +F) would 
mean that the part and the whole are fully separable, as they can continue to 
function properly after separation; a node labelled (— i, —I) corresponds to the 
case of non-separability, as both the part and the whole would cease to exist 
when separated. 

3.3 Classification of Part-of Relations 

The criterion of exclusiveness results in an ordered set {^o, • ■ • , of part-of 
relations, while essentiality gives us {^ij| i,j € {!,..., 6}}, c.f. Figures 4 and 6. 
Comparing our results with e.g.: [11,9,13,8] it should be noted that [11] and [9] 
include topological concepts that would be crucial for more detailed treatment 
of separability. Both [13] and [8] focus on transitivity of the part-whole relation, 
which is only one of the possible ways of employing ^ in automated inferencing. 
The issue of transitivity, from our exclusiveness and essentiality perspective, 
converts to the issue of composition of ^-relations, i.e., given ^i, ^2 as elements 
of the set ■ • ■ , *, j = 1, • • ■ , 6}, what is o < 2 , or which element 

of the set it is? We certainly have that, e.g., o is a dii (see Figure 4) and 
— 22 ° ^22 is a ^22 (see Figure 6) but we do not provide the whole composition 
table here. 
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4 Conclusion 

In this paper we have discussed some aspects of ontology of properties [7] and 
its use in constructing a taxonomy, providing some simple order-theoretic view 
of the criteria. We have then employed the ontology to restrict our treatment of 
part-of to types. We then proposed some criteria for classifying part-of and 
derived the classification. 

There is a large number of relevant issues that have been given no treat- 
ment in this paper. For instance, an axiomatic perspective on mereologies and 
mereotopologies presented in [11] provides an ordering on them. The approach 
would be crucial for a proper treatment of separability, where topological notions 
such as overlap, connection and boundary are needed. Other relevant references 
include [9] and [4]; for a recent paper on separability, addressing identity (does 
the tail of a cat exist?) see [12]. Summarising, vindicating ordered structures 
and exploring kind-of and part-of are important for ontological modeling and 
information integration. 
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Abstract. The Belief, Desire, Intention {BDI) architecture is increasingly be- 
ing used in a wide range of complex applications for agents. Many theories and 
models exist which support this architecture and the recent version is that of Ca- 
pability being added as an additional construct. In all these models the concept 
of action is seen in an endogenous manner. We argue that the Result of an action 
performed by an agent is extremely important when dealing with composite ac- 
tions and hence the need for an explicit representation of them. The Capability 
factor is supported using a RES construct and it is shown how the components of 
a composite action is supported using these two. Further, we introduce an OPP 
(opportunity) operator which in alliance with Result and Capability provides bet- 
ter semantics for practical reasoning in BDI. 



1 Introduction 

A paradigm shift is happening in both Artificial Intelligence and mainstream computer 
science with the advent of agents and agent-oriented approaches to developing systems, 
both on a theoretical and practical level. One such approach called BDI takes mental 
attitudes like Belief, Desire and Intention as the primitives and has given rise to a set 
of systems called Intentional Agent Systems [2, 5,7,9]. Of these the one by Rao and 
Georgeff [13] has been widely investigated due to its strong links with theoretical work. 
Many modifications have been made since the initial work, the most recent being the 
addition of a Capability [11] construct along with the three primitive modalities. In 
all these systems the concept of action is seen in an endogenous manner. Though it is 
possible to come up with accounts of action without representing them explicitly, many 
problems that plague endogenous formalisations can be avoided in exogenous ones. The 
later work by Rao [12] makes this shift but then it is restricted to the planning domain. 

This paper can be viewed as a further extension of the existing BDI theory whereby 
we reason about the mental state of an agent during the execution of an action in an 
exogenous way. We investigate the close connection between the result of an action 
performed by a BDI agent and its capability of achieving that result. We argue that 
though the agent might have a capability to perform an action it need not be the case 
that the opportunity should always accompany it. This view gets importance when we 
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take into consideration composite actions where one action follows the other {4 >\ ; 4 > 2 ), 
which means an agent performs tpi followed by (f>2 ■ In such cases the result of the com- 
ponent parts of the action is needed for the overall success of the action. It also seems 
reasonable to declare that the agent has the relevant opportunity to perform the com- 
ponent actions in such a way that the execution leads to an appropriate state of affairs. 
By making actions explicit in BDI we try to avoid some of the problems that plague 
the endogenous systems when dealing with composite actions. We describe a formal 
relationship between the Result, Opportunity, Belief, Desire and Intention modalities. It 
is important to note the close connection between Intention and Result. For instance, if 
an agent intends to perform a plan, we can infer under certain conditions he intends the 
result of the plan. Similar is the case with Goals and Results. 

This work is partially motivated by the KARO architecture of Van Linder [10] 
whereby we indicate how Result and Opportunity can be integrated to the existing BDI 
framework. Such an addition definitely paves way for a better understanding of the dy- 
namics of BDI Systems. The rest of the paper is organised as follows. In section 2 we 
make a distinction between intentional action (actions with a pre-defined intention) and 
intending an action (actions with future Intention) and claim that composite actions 
ht well under actions with future intention. Section 3 gives a brief summary about the 
original BDI logic as developed by Rao [13] and the recent version of it with the Capa- 
bility construct [11]. Sections 4 and 5 integrate two new operators RES and OPP with 
the existing BDI architecture. Section 6 gives the full picture of the new semantics. We 
have purposefully avoided the use of any temporal operators as it remains part of the 
future work. In section 7 we formalise the commitment axioms according to the new 
semantics and the conclusion and future work is depicted in section 8. 

2 Intentional Action & Intending an Action 

When one takes into account the compositional nature of actions ; (f >2 (<f>i followed 
by it seems contradictory to believe that endogenous logics alone can account for 
the mental state of an agent during the execution of such actions. The problem with 
the current formalisms is in their failure to differentiate Intentional Action (Predefined 
Intention) from Intending to do an Action (Future Intention). Most of the work in BDI 
represent actions in the former manner. In the work of Rao [13] formulas like BEL(#), 
GOAL(#) etc. are used to denote the belief and goal of an agent performing an action f. 
The formalism remains true for single actions, but when it comes to composite actions 
like (fi ; ^ 2 ) it fails to do justice as it is taken for granted that the execution of the first 
action necessarily leads to the second without mentioning anything about the result of 
the hrst action on the second. Based on the existing BDI architecture the concept of 
composite actions could be formalised as 

INT(does(^i; ^ 2 )) => does(4>i; 4>2)- 

This need not be the case as the performance of could result in a counterfactual state 
of affairs. It seems crucial to consider the result of the hrst action for the overall success 
of the composite action. In the same manner the formulas like 

GOAL(,^i;,^ 2) ^ CAP(GOAL(,^i;,^2)) 




392 V. Padmanabhan, G. Govematori, and A. Sattar 



seem to be problematic as the formulation doesn’t tell anything about the ability of the 
agent if the first action results in a counterfactual state of affairs. It doesn’t mention 
anything regarding the Opportunity the agent has in performing the second action. 

It is important to make a division between the two action constructs of Intentional 
and Intending for our framework. The former relates to a predefined intention, where the 
Result of an action is taken for granted, whereas the latter concerns a future intention, 
where further deliberation is done as to what the result would be before an action is per- 
formed. Davidson [6] oversees such a division and extends the concept of intentionally 
doing to that of intending to. Though Bratman [1] points out this disparity the current 
formalisms does not allow for sound representation using the existing modal operators. 
Hence the need for additional constructs like RES and OPP. In intentional action, there 
is no temporal interval between what Davidson terms as all-out evaluation and action. 
So there is no room for further practical reasoning in which that all-out evaluation can 
play a significant role as input. The BDI framework gives primary importance to prac- 
tical reasoning and hence to means-end reasoning which is important to avoid further 
deliberation at the time of action. Therefore it seems appropriate to categorise compos- 
ite actions under future intentions as they play a crucial role in our practical thinking. 
More importantly, we form future intentions as part of larger plans whose role is to 
aid co-ordination of our activities over time. As elements in these plans, future inten- 
tions force the formation of yet further intentions and constrain the formation of other 
intentions and plans. 



3 The BDI Logic 

The logic developed by Rao and Georgeff [13] is based on Computational Tree Logic 
(CTL*) [4] extended with a first order variant for the basic logic and a possible-worlds 
framework for the Belief, Goal and Intention operators. The world is modelled using a 
temporal structure with a branching time future and a single past called a time-tree. A 
situation refers to a particular time point in a particular world. Situations are mapped 
to one another by occurrence of events. The branches in a time tree can be viewed as 
representing the choices available to the agent at each moment in time. There are two 
kinds of formulae in the logic called the state formulae and path formulae. The former 
are evaluated at a specified time point in a time tree and the latter over a specified path 
in a time tree. Two modal operators optional and inevitable are used for path formulas. 
optional is said to be true of a path formula # at a particular point in a time-tree if # is 
true of at least one path emanating from that point, inevitable is said to be true of a path 
formula # at a particular point in a time-tree if # is true of all paths emanating from 
that point. The standard temporal operators O (eventually), □ (always), Q (next) and 
U (until), operate over state and path formulae. These modalities can be combined to 
describe the options of an agent. 

Beliefs, Goals and Intentions are modelled as a set of belief-, goal- and intention 
accessible worlds associated to an agent in each situation. An agent x has a belief #, 
at a time point t (BEL(#)), if # is true in all belief-accessible worlds. It is the same 
case for goals (GOAL(#)) and intentions (INT(#)). The logic is based on the concept 
of strong realism which requires the goals to be compatible with beliefs, and intentions 
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with goals. This is done hy requiring that for every helief-accessible world w at time- 
point t, there is a desire-accessible world w' at that time point which is a sub-world for 
w. The converse does not hold as there can be desire-accessible worlds that do not have 
corresponding belief-accessible worlds. There are similar relationships between goal- 
accessible and intention-accessible worlds. The axiomatization of beliefs is the standard 
weak-S5 (or KD45) modal system [8]. The D and K axioms are adopted for goals and 
intentions, which means that goals and intentions have to be closed under Implication 
and have to be consistent. We are concerned with the semantics of the mental attitudes 
and the details concerning the possible worlds semantics for various state and path 
formulae, is given in Appendix A. The set of belief-accessible worlds of an agent x 
from world w at time t, is denoted by (x) . Similarly we use (x) and 7“ (x) to 
denote set of Goal and Intention-accessible worlds of agent x in world w at time t, 
respectively. When we state the rules and axioms the world w is taken for granted and 
the formalism is based on the agent, action and time. The semantics for beliefs, goals 
and intentions can be dehned formally as follows 

Definition 1 For an interpretation M, with a variable assignment v, a possible world 
w and a temporal variable t, the semantics for the mental attitudes can be given as: 

- M,v,wt 1= BEL(#) iffyw' G B^{x), {M,v,w[) |= #; 

- M,v,wt 1= GOAL(#) iff^w' G Gf{x), {M,v,w[) |= #; 

- M,v,wt \= INT(#) fJVw' G I^{x), #. 

The rules and axioms depicting the semantic conditions is given as in [13]. The temporal 
variable t stands for a constant. We do not make any explicit representation of time as 
it remains part of future work. 

Definition 2 Let <7 be a formula, BEL, INT and GOAL be the modal operators for 
the mental constructs, done, does be the operators for event types, and inevitable be 
the modal operator for a path formulae; then we have the following axioms: 

A1 GOAL(#) ^ BEL(#) 

A2 INT(#) ^ GOAL(#) 

A3 INT(does(e)) => does{e) 

A4 INT(#) ^ BEL(INT(#)) 

AS GOAL(#) ^ BEL(GOAL(#)) 

A6 INT(#) ^ GOAL(INT(#)) 

A7 done{e) => BEL(done(e)) 

A8 INT(#) ^ inevitableO{-^mT{$)) 

Axiom A3 seems to be problematic because of the fact that the event e need not be 
necessarily restricted to a single action. If the agent has a choice of actions at the current 
time point, he/she would be incapable of acting intentionally until she deliberates and 
chooses one of them. It is the same case when the particular event is a composite action. 
The agent needs to deliberate on the result of the first action for the successful execution 
of the second one. It might also be the case that the agent lacks the relevant opportunity 
at that particular time point of doing the specific action. It becomes more relevant with 
the addition of the capability construct as given below. 
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The basic axioms with the capability construct are the same as those given in [11]. 
The temporal variable has been added in the semantics. 

Cl CAP(#) ^ BEL[#) 

C2 GOAL(#) ^ CAP(#) 

C3 CAP(#) ^ BEL(CAP(#)) 

C4 GOAL(#) ^ CAP(GOAL(#)) 

C5 INT(#) ^ CAP(INT(#)) 

The semantic condition of C2 and C3 can be given as follows 

Definition 3 Let G“ {x) be the set of capability-accessible worlds of agent x in world 
w at time t. 

- \fw' e Cf{x),3w" e Gf{x) such thatw" C w' ; 

- Vw' e Bf ,Mw'' e Cf'{x) we have w" € Cf{x) 

The first constraint means that for every capability-accessible world w' at time-point t, 
there is a goal-accessible world w” at that time-point which is sub-world of w' . The 
converse doesn’t hold as there can be goal-accessible worlds that do not have cor- 
responding capability-accessible worlds. The second constraint is more complicated 
and deviates from the original interpretation as in [13]. It means that for every belief- 
accessible world w' at time-point t, all the capability-accessible worlds w” which is a 
member of the belief (capability)-accessible worlds w' at time-point f is a member of 
the capability-accessible worlds w at time-point t, i.e., if the agent has a capability to 
achieve then the agent believes that she has such a capability. 

4 Integrating Results 

The BDI logic and the semantic conditions stated in the previous section shows that the 
compositional behaviour of actions has not been dealt within the BDI architecture. With 
the recent addition of the Capability construct we believe that it is worthwhile exploring 
this concept. Whereas the BDI framework is concerned with finding out what it means 
exactly to have the ability to perform some action, we try to focus on the compositional 
behaviour of actions. In other terms we are concerned with finding a relation between 
the capability to perform a composite action and relate it with the capability for the 
components of that action. Not all actions are treated equally in our approach but instead 
the result of each action is determined individually and then the conclusion is made 
whether the agent succeeds in performing that action. Three types of actions are dealt 
with {(f)i ; (f> 2 ) (<f>i followed by ^ 2 ), {while do t/j) as long as # holds) and (if$ then 
(j>i else (f> 2 ) i<f>i if ^ holds and (f >2 otherwise). The composite action {fi ; ^ 2 ) is discussed 
in detail. An additional operator RES (result) is introduced to show the success/failure 
of the component actions. The RES operator functions as a practition operator which 
indicates the sequence of actions being performed, i.e., which action is performed next. 
The existing BDI architecture doesn’t mention anything about the actual execution of 
actions. Since the transition caused by the execution of the action {fi; ^ 2 ) equals the 
sum of the transition caused by and the one caused by (f >2 in the state brought about 
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by execution of (f>i , the RES operator helps in acting as a filter which checks whether 
the first action results in a counterfactual state or not. Such a filtering helps in avoiding 
further deliberation at the time of action as would otherwise he in situations arising 
from counterfactual states. For example, the success of the printer command (Ipr), in a 
Unix environment, depends on the result of the execution of the command in the spooler 
phase followed by the recognition of the command hy the printer in the communication 
phase. Here the action needs to be broken down into compartments and the success 
of each action should be validated for the overall success. In such circumstances the 
RES operator helps in providing the necessary specification. This goes in alliance with 
our view of categorising composite actions under future intentions, where the scope of 
practical reasoning is more. 

Definition 4 Let ^i, be actions, then the axioms for the operator RES are: 

R1 CAP(does(^i;^ 2 )) ^ /\ BEL(does(^i)) A BEL(RES(does(^i))) ^ _L 

i^l,2 

R2 GOAL(6?oes(^i ; ^ 2 )) => f\ C AF (does {(f)i)) A RFjS{does{(f)i)) ^ 1. 

i^l,2 

R3 CAF{does4>i;4>2)) ^ /\ BEL{C AF {does {4>i))) A RES{does{4>i)) ± 

i^l,2 

R4 GOAL{does{(f)i](f) 2 )) => CAF{GOAL{does{(f)i))) ARES{does{(f)i)) ^ 1. 

i^l,2 

R5 mT{does{4>i]4>2)) ^ /\ C AF{mT {does {4>i))) A RES{does{4>i)) ± 

i^l,2 

The first axiom states that an agent has the capability of performing a composite action 
(j>i ; (f >2 then at some point of time the agent believes in doing and (f >2 and believes 
that the performance of does not end in counterfactual state of affairs (i.e, it does 
not end in falsity). Similarly the third axiom states that an agent has the capability of 
performing a composite action 4 >i] 4 > 2 , then at some point of time, the agent believes 
that it has the capability of doing and believes in the capability of doing (f >2 and the 
result of (j>i does not end up in a counterfactual state of affairs. 

The semantic conditions for RES are similar to those given in Definition 3. For 
instance it can be shown that the semantic condition for R2 is 

Vw' G Cf{x),3w” G Gf{x),3w"' G Rf{x) such that w" C w' and w'" C w' 

where Rf (x) is the set of result-accessible worlds of agent x in world w at time t. 

This constraint means that for every capability-accessible world w' at time-point t, 
there is a goal-accessible world w” at that time-point which is a sub-world of w' and a 
result-accessible world w'" which is a sub-world of w'. The converse doesn’t hold as 
there can be Goal-accessible worlds that do not have corresponding capability as well 
as result-accessible worlds that do not have corresponding capability but only has the 
opportunity. We shall deal with the opportunity construct in the next section. 

The action constructors dealing with while ^ do tp (which means that tp as long 
as holds) and if then tpi else <p 2 {<pi if ^ holds and <p 2 otherwise) is crucial from 
computational point of view. For an agent to be able to perform an action while # do ^ it 
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has to have the ability to perform some finite actions constituting the body of the while- 
loop as well as the opportunity to perform all the steps. Agents should not be able to 
perform an action that goes indefinitely. These specifications are formally represented 
by the following two axioms. 

R6 CAP {whiled do tl^) h#V(#ABEL(CAP(does(V')))) ARES(done(^)) ^ _L] 
R7 CAP^i/# then (f>i else ^ 2 ) 

[$ A BEL(CAP(does(^i))) A RES(done(^i)) ^ _L] V 
A BEL(CAP(does(^ 2 ))) A RES(done(^ 2 )) 9^ -L]. 

The first proposition states that an agent is capable of performing an action while # do 
ip, as long as # holds and the agent believes that it has the capability of ^ and result 
of tl) does not end in falsity. Similarly R7 can be read as, an agent has the capability of 
performing an action if # then else ^2 , if ^ holds and the agent believes that it has 
the capability of and the result of is true, or it is the case that, # does not hold 
and the agent believes that it has the capability of (f >2 and result of (f >2 does not end in a 
counterfactual state of affairs. 



5 Integrating Opportunity 

Though in many cases it seems reasonable to assume that Capability implies Opportu- 
nity, when it comes to practical reasoning Opportunity seem to play a significant role. 
Van Linder [10] explains opportunity in terms of the correctness of action. An action 
is correct for some agent to bring about some proposition iff{if and only if) the agent 
has the opportunity to perform the action in such a way that its performance results in 
the proposition being true. Integrating opportunity lays further constraint on the part 
of the agent to think about an action before getting committed. Consider the example 
of a lion in a cage, which is perfectly well capable of eating a zebra, but ideally never 
has the opportunity to do so.' Using the BDI formalism we would have to conclude 
that the lion is capable of performing the sequential composition eat zebra ; fly to the 
moon which hardly seems to be intuitive. In such situations it is very important to know 
the combination of Capability and Opportunity so that no unwarranted conclusions can 
be drawn. We introduce an operator OPP whose intuitive meaning is agent x has the 
opportunity. The axioms for the OPP operator together with the Capability construct 
can be given as follows 

Definition 5 Let fi, <p 2 be actions, then we have 

01 CAP(does(^i;^2)) ^ f\ BEL(OPP(does(^i))) 

i^l,2 

02 GOAL(6?oes(^i ; ^ 2 )) => f\ C AF (does {(f)i)) A OFF {does {(f)i)) 

i^l,2 

03 CAF{does{4>i]4>2)) ^ /\ BEL{C AF {does {4>i))) A OFF {does {4>i)) 

i^l,2 

* The example is taken from [10]. 
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04 GOAL(does(^i ; ^2)) => f\ CAP(GOAL(does(^,))) A OPP(does(^,)) 

j=l ,2 

05 INT(does(^i;^2) ^ /\ CAP(INT(does(^i))) A OPP(does(^i)) 

j=l ,2 

06 CAP{while # do ^ V (# A BEL(CAP(does(V')))) A OPP {does {^p))] 

07 CAP(t/# then <Pi else ^2) ^ A BEL(CAP(does(^i))) A OPP(does(^i))l V 
h# A BEL(CAP(does(^2))) A OPP{ does { 4 > 2 ))] 

The third axiom states that an agent has the capability of performing (j>i ; (f >2 then the 
agent believes that he has the capability of (j>i , if he has the opportunity of (j>i , and, he 
has the capability of (f>2 , if he has the opportunity of (f>2 ■ Similarly 07 can be interpreted 
as an agent has the capability of doing the action (if # then (j>i else ^ 2 ) then either # 
holds and the agent believes that he/she has the capability of (j>i provided the opportu- 
nity exists or - 1 # holds and the agent has the capability of (f>2 provided the opportunity 
exists. The other axioms can be interpreted in a similar manner. 

6 Opportunity + Results 

In [10] a division is made between optimistic and pessimistic agents and the interpreta- 
tion of the OPP formulae is done accordingly. They make use of two dynamic operators 
{doi{a))(fi and [doi{a)](p. The first one denotes that an agent i has to have the oppor- 
tunity to perform the action a in such a way that p will result from the performance 
(Pessimistic Approach): A pessimistic agent needs certainty. The second one is the dual 
of the first and states that if the opportunity to do a is present then p would be among 
the results of doi(a) (Optimistic Approach). The formula [doi(a)]p is noncommittal 
about the opportunity of the agent i to perform the action a. We do not go for such 
a division and interpret the OPP formulae in a realistic manner linked with the RES 
operator. Such a formalism helps in avoiding unwarranted results as were seen in the 
earlier examples. In what follows we present the axioms capturing this intuition. 

ORl CAP(does(^i;^2)) ^ 

BEL(does(^i)) A OPP(does(^i)) A RES(done(^i)) ^ _L 
A BEL(does(^2)) A OPP (does (^2)) 

OR2 GOAL(does(^i;^2)) ^ 

CAP(does(^i)) A OPP(does(^i)) A RES(done(^i)) ^ _L 
A CAP(does(^2)) A OPP(does^2) 

OR3 CAP(does(^i;^2)) ^ 

'BEL(CAP(does(^i))) A OPP (does (tpi)) A RES(done(^i)) ^ _L)‘ 

A BEL(CAP(does(V2))) A OPP(does(^2)) 

OR4 GOAL(does(^i ; ^2)) 

CAP(GOAL(does(^i)) A OPP(does(^i)) A RES(done(^i)) ^ _L) 

A (CAP(GOAL(does(^2)) A OPP { does {<f)2)) 
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OR5 INT(does(^i;^2)) ^ 

'CAP(INT(does(^i)) A OPP(does(^i)) A RES(done(^i)) ^ ±)' 

A (CAP(INT(does(^2)) A OPP(does(^2)) 

OR6 CAP{while<P do tjj) => 

■(# A BEL(CAP(does(V')) A OPP{does{ip)) ' 

A RES{done{ip)) ^ A) V 

OR7 CAP{if<P then (f>ielse ^ 2 ) => 

# A BEL(CAP(does(^i)) A OPP(does(^i))A 
RES(done(^i)) ^ A) V 
A BEL(CAP(does(^2)) A OPP(does(^2))A 
RES (done (^1)) ^ A) 

Axioms OR1-OR7 are a formalisation of the results and opportunities together with the 
capability operator for composite actions. OR3 states that agents have the capability of 
doing a composite action ^ 2 ) to achieve # then the agent believes that it has the 
capability, provided the right opportunity, in each of the atomic states and the resulting 
condition is in alliance with its beliefs, i.e., it does not result in counterfactual situations. 
The actual execution of actions is made explicit through such a formalisation. Similarly 
OR6 states that if an agent has the Capability and Opportunity to perform a while-loop 
then it keeps this opportunity under execution of the body of the loop as long as the 
condition holds, i.e., as long as the result is true. 

7 Commitment Axioms Revisited 

In [13] a division is made in the commitment strategies of an agent, categorising an 
agent as blindly committed agent, single minded agent, and open-minded agent. A 
blindly committed agent maintains her intentions until she actually believes that she 
has achieved them; the single minded agent maintains her intentions as long as she be- 
lieves that they are still options; hnally an open-minded agent maintains her intentions 
as long as the intentions are still her goals. Based on the semantics given in the previous 
section the formalisation can be given as follows 

CAl lNT{inevitable'0'$) => 

inevitable{lNT {inevitableO'$) [J BEL(RES(#))) 

CA2 lNT{ineveitable'0'$) => 

inevitable(JA\P{inevitable<>$) [JBEL(CAP(#)) V -iBEL( 0 PP(o/ 7 ho«fl/O#))) 
CA3 INT(mevehflh/eO#) => 

inevitable(JA\P{inevitable<>$) [J BEL(GOAL(#)) V -iCAP {optionalOd^)) 

The self-aware agent mentioned in [11] can be added to the above set of commitment 
strategies directly. It seems that the formalisation depicted above is much more intuitive 
than the one given by Rao and Georgeff [13]. For instance the axiom of blind commit- 
ment states that, if an agent intends that inevitably # be eventually true, then the agent 
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will inevitably maintain its intentions until she believes in the result of The addition 
of result is important in the sense that the blindly committed agent maintains the inten- 
tions until the agent actually believes that she has achieved them, i.e., until the agent 
has a justihed true belief. This condition is needed for an agent blindly-committed to 
her means to inevitably eventually believe that she has achieved her means or ends. 
It also seems to be in alliance with the philosophical theories concerning the nature 
of belief. Similarly a single-minded agent maintains her intentions as long as she be- 
lieves that she has got the capability for it. Since we do not say anything about an agent 
optionally achieving particular means or ends, even if the opportunity is present, the 
agent does not believe that optionally # be eventually true which is captured by the 
-iBEL(0PP(o/7ho«a/O#)). Finally, an open-minded agent maintains her intentions as 
long these intentions are still her goals or as long as she lacks the ability of optionally 
achieving them. 



8 Conclusion and Future Work 



The representation and reasoning about composite actions in a BDI environment forms 
the primary contribution of this work. Our work is motivated by the fact that many 
BDI systems provide no clue as to the actual execution of actions, and are only able 
to perform actions in an endogenous manner. When dealing with composite actions the 
actual execution of actions need to be represented and reasoned about for the overall 
success of the action. The addition of the two operators RES and OPP strengthens the 
semantics and functions as a filter in avoiding counterfactual situations. Though some 
mention has been done in [3] about the composite action construct (^i ; ^ 2 ), it has been 
restricted to the Intention domain and nothing has been mentioned regarding the result 
of the actions. The only other comparable work is given by [10]. 

An explicit representation of temporal constructs can be seen as a further extension 
to this work. We have used the temporal operator as a static variable. When it comes to 
composite actions it is important to mention explicitly the time of each action and the 
temporal duration of the commitment an agent has towards each action. The interpre- 
tation of the O (next) operator in the original logic needs to be verified. For example 
when it comes to composite actions like {(f>; ip) the temporal operator Q can be inter- 
preted either as <>{(p OV') or {(p => Qltp)- The temporal notion as to whether the 
action is performed now or eventually needs to be clarified. If would also be worthwhile 
to investigate does{(p; ip) in terms of (done(p] doesip), i.e., to find whether does{(p; ip) 
is concurrent or sequential. 
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A Possible World Semantics 

A structure M is a tuple M = (W, {5u,}, {i?™}, B, D, I, L) where W is a set of pos- 
sible worlds, Sw is a set of time points in world w; C x is a total binary 
temporal accessibility relation; L is a truth assignment function that assigns to each 
atomic formula the set of world-time pairs at which it holds. B is a belief-accessibility 
relation that maps a time-point in a world to a set of worlds that are belief accessible 
to it; and D and I are desire and intention accessibility relations, respectively, that are 
defined in the same way as B. 

There are two types of formulas: state formulas (which are evaluated at a state in a 
time-tree) and path formulas (which are evaluated against a path in a time-tree). They 
are defined as follows. 

- any propositional formula is a state formula; if # A are state formulas then so 
too are # V and 

- if # is a state formula then so too are BEL(#) and INT(#) 

- if iZ' is a path formula then optional{'P) and inevitable{'P) are state formulas. 

- Any state formula is also a path formula 

- if ^ and are path formulas then so too are ^ and Dip 

A full path in w is an infinite sequence of time points such that (w,, w,_|_i) € Rw for all 
i. Satisfaction of a state formula # is defined with respect to a structure M, a world w 
and a time point t, denoted by M, wt |= ^ . 

Satisfaction of a path formula 'P is defined with respect to a structure M, a world 
w, and a full path (wt „ , Wt ^ , . . .) in world w. 

- M, Wt 1= ^ iff (w, f) G L(#), where # is an atomic formula. 

- M, Wt 1= iff M,wt ^ ^ 

- M, Wt 1= ^ 1 V #2 iff M, Wt \= ^ 1 or M, wt \= ^ 2 

- M, {wtg , Wfi , . . .) 1= # iff M, wtg \= where # is a state formula. 

- M, (wt „ , Wfi , . . .) 1= 0<P iff M, (wt, ,...)!=# 

- M, {wt „ , Wfi , . . .) 1= OP iff e (wfo , Wfi , . . .) s.t. M, {wt ^ , wt^_^^ ^ . . .) \= ip 

- M, wto 1= Drp iff M, (wto , Wf 1 , . . .) for all full paths (wf„ , Wf j , . . .) 

- M, {wto , Wti , . . .) 1= ^1 U ^2 iff for some i > 0, M,wt \= P 2 and for all 0 < j < 

i, M,wtj \= Pi 

- M,wto N optional{P) iff there exists a full path {wto,Wti, ■ ■ ■) such that 
M,(wto,Wt^,...) \=P. 
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Abstract. Wrappers have recently been used to obtain parameter op- 
timizations for learning algorithms. In this paper we investigate the use 
of a wrapper for estimating the correct number of boosting ensembles in 
the presence of class noise. Contrary to the naive approach that would 
be quadratic in the number of boosting iterations, the incremental algo- 
rithm described is linear. 

Additionally, directly using the k-sized ensembles generated during k-fold 
cross-validation search for prediction usually results in further improve- 
ments in classification performance. This improvement can be attributed 
to the reduction of variance due to averaging k ensembles instead of us- 
ing only one ensemble. Consequently, cross-validation in the way we use 
it here, termed wrapping, can be viewed as yet another ensemble learner 
similar in spirit to bagging but also somewhat related to stacking. 

Keywords: machine learning. 



1 Introduction 

Boosting can be viewed as an induction method that sequentially generates a 
set of classifiers by reweighting the training set in accordance with the perfor- 
mance of each intermediate set of classifiers. Theoretical attempts at explaining 
boosting’s superior performance, based on so-called margins [17], would imply 
the following relationship between predictive performance of an ensemble and its 
size: given sufficiently expressive base classifiers, in the limit (i.e. the ensemble 
consists of infinitely many classifiers) each training example will have a margin 
of 1. This infinite ensemble will also be optimal in terms of predictive error on 
new test examples. 

Obviously, one would expect this relationship to hold only in noise-free cases, 
and quite a few recent studies (e.g. [3]) have shown boosting’s potential for 
over-fitting noisy data. Gonsequently, quite a few authors have proposed and 
investigated various modifications of the original AdaBoostMl algorithm [5]. 

Some of these attempts focus on the reweighting policy, which in the original 
algorithm utilises an exponential function. Modified reweighting policies try to 
be less aggressive [4,6]. Usually the modified algorithm includes an additional 
parameter for regularization, which could, for example, be an estimate of the 
optimal ensemble size, or the maximal percentage of training examples, that 
the ensemble is allowed to misclassify. Also, Friedman’s additive regression in- 
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terpretation of boosting [8] adding a shrinkage parameter, can be seen as a 
counter-measure to exponential fast fitting of the training data. Alternatively, 
others have attempted to counter noise problems by using some kind of bagging 
around boosting. Whereas [12] directly bags boosted ensembles, MultiBoost 
[19] utilizes a bagging variant called wagging which simulates bagging’s sample- 
with-replacement by poisson-distributed weights. 

What all these methods have in common is their indirect approach of solving 
the anticipated problem: all force the user to specify additional parameters. Most 
of these parameters have an obvious interpretation, so we can expect the user 
to supply reasonable values. Take the case of BrownBoost as an example: the 
user is supposed to supply the true noise-level c as a parameter, thus allowing the 
algorithm to classify c% of the training examples incorrectly. Thus, we have only 
shifted the burden of selecting a reasonable ensemble size upfront to estimating 
another parameter. Still, there is no guarantee that the supplied estimates are 
effective for a given dataset. 

Alternatively, in this paper we investigate a more direct approach: we try 
to estimate the appropriate size of the boosted ensemble directly by standard 
cross-validation. We will show how cross-validation can be computed efficiently 
for boosters, and we will also show that it naturally leads to ensembles of boosters 
at no extra induction time cost. The next section defines the algorithm. In section 
3 we report on experiments involving various boosters and various levels of noise 
in datasets. Section 4 discusses our findings and in section 5 we draw our final 
conclusions. 



2 Wrapping 

Usually, boosting seems to be pretty stable even in the presence of noise: if 
enough boosting iterations are performed, a boosted ensemble outperforms an 
unboosted base-level learner most of the time, sometimes by an impressive 
amount. Even in the presence of noise the behaviour does not seem to dete- 
riorate too much. But judging by the results cited above, boosting could have 
performed better in cases with noise. Recently, it has been shown that if the op- 
timal Bayes error for some dataset is different from zero, the boosted ensemble 
will not be optimal in the limit, but some initial prefix of the same ensemble 
will [9]. 

We try to directly estimate this optimal size of the boosting ensemble by 
simple cross-validation. This is reminiscent of so-called early stopping in neu- 
ral network induction (see e.g. [16]), where some portion of the training set is 
set aside and used as an independent evaluation set for judging whether perfor- 
mance is still improving or not. Standard k-fold cross-validation seems to be a 
more principled estimator, but of course involves a k-fold higher runtime cost. 
Trying to optimize parameters by cross-validation is not new, either. Most im- 
portantly, it has been formalized and called the wrapper approach for feature 
subset selection [10]. Also, in [7] cross-validation is used to determine the optimal 
size of an ADTree for a given dataset. 
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Table 1. Pseudo-code for two ways of estimating the optimal ensemble size: standard 
cross-validation, which is 0{k * N'^), and wrapping - incremental cross-validation - 
which is 0{k * N). 



Standard CV 


Wrapping (incremental CV) 




func Wrap (k, data, booster ,N) 




let bestSize = 0 
let minError = 1.0 


f unc est imateSize (k , data , booster , N) 


let boosters = new booster [k] 


let bestSize = 0 
let minError = 1.0 


for i from 1 upto k 

boostersfi] . initCV(data, i ,k) 
endf or 


for T from 1 upto N 
error = 

cvEst imate (k , dat a , booster , T) 
if (error =< minError) 
minError = error 
bestSize = T 
endif 
endf or 


for T from 1 upto N 
let error =0.0 
for i from 1 upto k 
boosters [k] . iterate 0 
error += boosters [k] . estimate () 
endf or 

error = error/k 
if (error < minError) 


return bestSize 


minError = error 
bestSize = T 




endif 




endf or 




return bestSize 



Simple-minded application of cross-validation is hampered by excessive run- 
time needs. The issue here is not the multiplication due to k folds being used, 
but the fact that using simple-minded cross-validation for determining the right 
ensemble size shows quadratic behaviour in the number of calls to the underlying 
base-level learner: one call for an ensemble of size one, two calls for an ensemble 
of size two, and so on, yielding a total of fc * (n — 1) * n/2 calls for estimating all 
ensemble sizes from 1 up to n using k-fold cross-validation. This will clearly be 
a prohibitive cost for most datasets. 

Luckily, there is a simple remedy available, due to the additive nature of 
boosting ensembles: for a given set of examples, the ensemble of size m is the 
union of the ensemble of size m — 1 plus one more base-level classifier. Both 
ensembles use exactly the same m — 1 classifiers. So the smart way to implement 
cross-validation is to do it incrementally, simply adding one base-level classifier 
after the other, interleaving these steps with performance estimation on the 
respective test-fold. Thus we can reduce the complexity of our estimation down 
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Table 2. Pseudo-code for the ensemble-returning variant of wrapping. 



Wen semble 

f unc WrapEnsemble (k , data , booster , N) 

let minError = 1.0 

let boosters = new booster [k] 

for i from 1 upto k 

boosters [i] . initCVCdata, i ,k) 
endf or 

let bestBoosters = boosters . clone () // <== CHANGED 

for T from 1 upto N 
let error = 0.0 
for i from 1 upto k 

boosters [k] .iterateO 
error += boosters [k] .estimateO 
endf or 

error = error/k 
if (error < minError) 
minError = error 

bestBoosters = boosters . clone () // <== CHANGED 

endif 
endf or 

return bestBoosters // <== CHANGED 



to k*n calls of the base-level learner, i.e. a linear number of calls. Therefore this 
cost is identical to the cost of just one k-fold cross-validation for the maximal 
ensemble size n, meaning that we get all the other estimates for sizes 1,2, up to 
n — 1 at no additional cost. Also, bagging k times a boosted ensemble of size n 
would involve exactly the same cost. 

We will call this improved implementation of cross-validation wrapping. Ba- 
sically, we are estimating a single integer parameter in the range from 1 to N, 
where N is user-specified. This improved version is applicable whenever the al- 
gorithm in question is incremental in that parameter, which is obviously true for 
boosted ensembles due to their additive nature. 

Pseudo-code comparing standard cross-validation to its incremental variant 
termed wrapping is given in Table 1. We assume that incremental boosters im- 
plement an interface that supports at least the following operations^: 



^ Of course, in practise the interface may also include additional functions for book- 
keeping, cleanup, general outputting, and so on. 
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— initCVCdata, i ,k) : which initializes the data-structures of the respective 
boosting algorithm, as well as separates the training data into the ith-fold 
for later testing and the remaining data as the ith-fold training set (it is 
sufficient to just keep the index and a local boosting weight for each training 
example, full copies are not required here). 

— iterate 0 : which performs one boosting iteration, e.g. adding the next best 
test to an ADTree, or adding another C4.5 generated tree to an AdaBoosted 
C4.5 ensemble. Each iteration will only use its ith-fold training data subset 
for induction. 

— estimate (): which returns an estimate of the predictive error rate of the 
ensemble at the current size, using the previously set-aside ith-fold test set. 

Additionally, we can further improve the utility of wrapping in the following 
way: with the above scheme we first estimate a good size, and then induce 
one ensemble of exactly that size using all of the training data. Alternatively, 
quite similar to bagging [2], we can also just directly use these k ensembles 
of optimal size m as computed during cross-validation, yielding a, k * m size 
ensemble reminiscent of a bagged boosting ensemble. In that case we don’t even 
have to perform the final induction step over all the training data. This variant 
is depicted in Table 2. As an alternative implementation, this variant might 
extract the best-sized ensemble for each fold separately, i.e. estimate the best 
size for each fold independently from all other folds. Thus the ensembles of each 
fold might vary in size. We have experimented with this alternative variant as 
well, but found that the estimates computed in such a manner were a lot more 
unstable, consequently causing inferior predictive behaviour. 

So, in summary, wrapping (using k-fold cross-validation) allows us to both 
estimate the best size m for a boosted ensemble as well as compute a k-sized 
ensemble of such m-sized boosted ensembles, all in one go. Best of all, the total 
runtime cost of wrapping is about the same as that of k-times bagging the booster 
to the respective maximal size n. 

In the next section we will investigate the utility of wrapping in terms of 
predictive error rates. 

3 Experiments 

This section compares the performance, in terms of predictive error rates^, of 
three different boosting algorithms and their bagged and wrapped versions re- 
spectively. 

The datasets used and their properties are listed in Table 3. All these sixteen 
datasets are taken from the UCI repository [1]. The datasets were evaluated 
using five times ten-fold cross validation. All datasets are two-class problems 
only, as some of the boosters we use are limited to such problems (currently). 
The noisy variants of these datasets were generated as follows: as we only deal 

^ Runtime increased by an order of magnitude: averaged factors are e.g. 8.43 for Bag- 
ging, 8.01 for Wrapping, and 9.04 for Wensembu normalized against AdaBoostMl. 
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with class-noise here, X% of the examples were chosen at random and their 
class- value was flipped. This noisification was done in a preprocessing step prior 
to experimentation, i.e. both training and testing was done on noisy data. This 
specific noise model (which has been used previously, e.g. in [14]) was chosen 
because of both its simplicity and its guaranteed noise levels: if X% is specified, 
exactly X% of all examples will be given a new, different class label. 



Table 3. Datasets used for the experiments 



Dataset 


Instances 


Missing Numeric Nominal 
values (%) attributes attributes 


breast-cancer 


286 


0.3 


0 


9 


breast- wise 


699 


0.2 


9 


0 


Cleveland 


303 


0.2 


6 


7 


credit 


690 


0.6 


6 


9 


diabetes 


768 


0.0 


8 


0 


hepatitis 


155 


5.4 


6 


13 


hypothyroid 


3772 


5.4 


7 


22 


ionosphere 


351 


0.0 


34 


0 


kr-vs-kp 


3196 


0.0 


0 


36 


labor 


57 


33.6 


8 


8 


promoters 


106 


0.0 


0 


57 


sick-euthyroid 


3163 


6.5 


7 


18 


sonar 


208 


0.0 


60 


0 


splice 


3190 


0.0 


0 


61 


vote 


435 


5.3 


0 


16 


votel 


435 


5.5 


0 


15 



To investigate the sensitivity of bagging and wrapping with respect to the 
underlying booster, we have conducted experiments with three different boosters: 

— ADTree (our WEKA version of it [13]) using randomized search and a max- 
imal ensemble size of 100. 

— AdaBoostMl over C4.5 (in their respective WEKA incarnations) with a 
maximal ensemble size of 10. 

— An ADTree variant that triples the size of the tree at each iteration, with a 
maximal ensemble size of 8. 

Unfortunately, it is somewhat tricky to depict all the variations along all 
axes available for comparison: noise-levels, algorithms, datasets. Therefore we 
will only give exemplary full tables of results for one base algorithm, namely 
for ADTree induction, for noise-free data and data with 30% class noise (the 
extreme cases), and only summary tables for everything else. So, Tables 4 and 5 
depict predictive error rates (all tabulated results are considered significant if the 
difference between two pairs is statistically significant at the 1% level according 
to a paired two-sided t-test) for the following four versions of ADTree induction: 
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Table 4. Predictive error, no noise. The best entry in each line is set in boldface, 
a prefix star marks values that are significantly different from the value in the first 
column. 



Dataset ADTree Bagging Wrapping Wensembie 



BREAST-CANCER 


31.59 


* 28.57 


* 26.02 


* 25.32 


BREAST-W 


3.83 


* 3.49 


* 4.32 


3.63 


CLEVE 


21.78 


* 17.15 


* 17.28 


* 16.03 


CREDIT-A 


15.10 


* 13.22 


15.68 


15.07 


CREDIT-G 


25.50 


* 23.58 


26.62 


24.60 


DIABETES 


26.22 


* 24.55 


26.45 


* 24.58 


HEART-STATLOG 


20.30 


* 18.00 


* 17.11 


* 17.33 


HEPATITIS 


18.45 


* 17.27 


17.53 


18.09 


IONOSPHERE 


8.25 


* 7.46 


8.43 


* 7.40 


KR-VS-KP 


0.86 


0.79 


0.84 


* 0.73 


LABOR 


12.33 


10.47 


13.27 


12.33 


PROMOTERS 


6.76 


7.71 


* 9.22 


5.84 


SICK 


1.13 


* 1.44 


* 1.37 


1.11 


SONAR 


13.66 


14.50 


* 16.64 


12.86 


VOTE 


4.05 


3.96 


4.32 


4.28 


VOTEl 


9.33 


* 8.60 


9.43 


* 8.43 



— ADTree: using randomized search boosted 100 times. In ADTrees one boost- 
ing iteration adds exactly one test to the current tree. 

— Bagged ADTrees: generate an ensemble of 10 ADTrees, each of size 100, by 
means of bagging. 

— Wrapped ADTree: use wrapping to determine the optimal ADTree size of up 
to 100 tests, then induce a single ADTree of that size using the full training 
set. 

— Wrapped ADTree ensemble: use wrapping to determine the optimal ADTree 
size of up to 100 tests, but instead of consequently inducing another tree, 
simply use the ensemble of the 10 trees of optimal size generated during 
cross-validation as the final ensemble. 

Table 6 depicts the number of significant wins, draws, and losses over all 
noise levels and base-learners in a pair-wise manner for the two pairs we think 
are the most reasonable pairwise competitors: the sole booster versus its simple 
wrapped form, as well as the bagged booster versus the wrapped ensemble. 
Finally, Table 7 depicts the number of significant wins, draws, and losses for all 
pairwise combinations. 

We can summarize all the figures of these tables into the following qualitative 
findings: 

— Wrapping seems to be able to choose a reasonable size for the underlying 
boosting algorithm, as it rarely performs significantly worse than the booster 
itself. 
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Table 5. Predictive error, 30% noise. See Table 4 for more explanation. 



Dataset ADTree Bagging Wrapping Wensembu 



BREAST-CANCER 


48.82 


* 45.83 


* 44.34 


* 44.69 


BREAST-W 


36.71 


* 35.34 


* 34.43 


* 33.71 


CLEVE 


43.87 


42.08 


44.04 


* 41.37 


CREDIT-A 


40.03 


* 37.91 


* 37.42 


* 36.20 


CREDIT-G 


44.78 


43.72 


* 43.40 


* 43.40 


DIABETES 


45.65 


44.61 


45.42 


45.03 


HEART-STATLOG 


47.85 


* 45.04 


* 41.56 


* 40.96 


HEPATITIS 


36.80 


* 34.12 


35.13 


* 33.24 


IONOSPHERE 


44.45 


* 40.97 


* 41.31 


* 39.84 


KR-VS-KP 


33.63 


* 32.44 


* 31.97 


* 31.91 


LABOR 


50.73 


49.60 


52.20 


51.53 


PROMOTERS 


52.44 


51.36 


51.45 


52.60 


SICK 


33.92 


* 32.51 


* 31.36 


* 31.14 


SONAR 


37.89 


38.66 


* 40.26 


37.56 


VOTE 


38.20 


* 35.77 


* 36.28 


* 34.75 


VOTEl 


36.68 


* 35.13 


* 30.61 


* 30.43 



— Bagging also improves performance over just boosting most of the time, and 
it seems to perform better than simple wrapping. 

— Wrapped ensemble performs as well as Bagging at low noise levels, and even 
better at higher noise levels. 

Interestingly, there is also the odd dataset where boosting simply outperforms 
every method, even in the presence of noise. We suspect that this behaviour will 
occur mostly in situations were the available training set is actually too small. 
As it has been shown in [18], usual learning curves are pretty steep initially 
up to a point where they finally level out asymptotically to the best value a 
specific algorithm can achieve for a particular dataset. Now if the size of the 
given training set lies within this first steep region of the learning curve, a few 
additional examples can make a big difference. So bagging, which on average only 
includes about two-thirds of the training set in each bag, may be disadvantaged. 
Similarly, ten-fold cross-validation only uses 90% of the training set for inducing a 
classifier for each fold, so it too may be disadvantaged, but to a lesser degree. Still, 
cross-validation seems to exhibit an over-fitting tendency for smaller training 
sets. 

We have repeated the same experimental setup (original algorithm, bagged 
version, wrapped version, wrapped ensemble) for two other boosting algorithms 
as well: AdaBoostMl over C4.5, as well as a variant of ADTree induction, where 
instead of choosing the globally best test at each boosting iteration all locally 
(at each prediction node) best tests are added, thus tripling the size of the tree 
at each iteration. Consequently, we have limited the total number of boosting 
iterations to 8 for this variant, and set it to 10 for AdaBoostMl, which seems to 
be a reasonable value reported for AdaBoostMl over C4.5 induction [15]. 
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Table 6. Significant wins, draws, and significant losses for various pair-wise com- 
parisons at various noise-levels. An entry “i-j-k” means that the first algorithm wins 
significantly i times, draws j times, and significantly looses k times against the second 
algorithm. 



Noise 


ADTree vs. Wrapping 


Bagging vs. Wensemble 


00% 


4-9-3 


3-8-5 


10% 


0-10-6 


2-12-2 


20% 


0-4-12 


2-7-7 


30% 


1-6-9 


0-10-6 




AdaBoost vs. Wrapping 


Bagging vs. Wensemble 


00% 


3-9-4 


3-11-2 


10% 


2-6-8 


0-5-11 


20% 


1-8-7 


0-14-2 


30% 


1-9-6 


0-13-3 




tADTree vs. Wrapping 


Bagging vs. Wensemble 


00% 


3-9-4 


5-5-6 


10% 


1-7-8 


4-7-5 


20% 


0-8-8 


1-6-9 


30% 


0-6-10 


0-8-8 



4 Discussion 

In this section we discuss two further opportunities offered by wrapping. First, 
wrapping allows for a more interactive approach to ensemble induction. As error 
estimates are computed sequentially for increasing ensemble sizes, these esti- 
mates can be displayed or graphed online, providing immediate feedback. This 
allows a user to immediately withdraw from investigating larger ensembles once 
the error estimates are either good enough or seem to have reached a plateau. 
Such interaction is valuable in exploratory data analysis. 

Second, the additive nature of both wrapping and bagging also allows for 
further compression of ensembles, provided the underlying base-level learner 
itself is also additive. This is not the case for Adaboost in general, but it is 
certainly the case for ADTrees. Ensembles of ADTrees can be merged into a 
single ADTree, which usually reduces the total size by 20 to 30% (we compare 
the total number of prediction nodes in a wrapped ensemble of ADTrees to the 
total number of prediction nodes in the equivalent single merged ADTree). We 
are currently looking into more sophisticated ways of merging, thus hopefully 
compressing ensembles even further. 

Regarding the apparent success of wrapped ensembles over bagging in high 
noise cases, an explanation of this fact would be most welcome, as both methods 
seem to be quite similar. Wrapping seems to enjoy a better variance reduction 
than bagging under these circumstances. At least, some experimental bias plus 
variance decompositions that we have computed in the same way as described in 
[11] seem to indicate that. On the other hand, we are reluctant to put too much 
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trust in these results, as their overall error sums seem to be of high variance, 
varying considerably with the specific split chosen for estimation. 



5 Conclusions 

Our empirical investigation of wrapping seems to indicate that wrapping is a 
viable (and efficient) alternative to bagging boosters, especially when we sus- 
pect considerable levels of class-noise in our data. Simple wrapping allows us to 
choose an appropriate size, and the wrapped ensemble variant looks even more 
promising: at zero noise they are equivalent to bagged ensembles, and at higher 
noise levels they significantly outperform bagged ensembles, and their induction 
times are about equal. So consequently, wrapped ensembles provide an effective 
and efficient safeguard for boosters against noise. 

In future work we hope to compare wrapping with some of the more so- 
phisticated regularization approaches we have mentioned in the introduction, 
especially a comparison with BrownBoost and MultiBoost should be most inter- 
esting. Furthermore, we want to concentrate more on larger KDD-class datasets. 
Such experiments might be able to further strengthen our hypothesis that cross- 
validation is prone to overfitting small datasets. Additionally, we want to inves- 
tigate the potential of merging, especially as a means of reducing total ensemble 
sizes, thus hopefully improving the comprehensibility of these merged ensem- 
bles. Furthermore, we are researching the applicability of wrapping to general 
complexity class estimation, a problem that is obviously not limited to boost- 
ing algorithms alone. Perhaps the most valuable achievement would be to find 
a way of replacing the currently user-specified maximal ensemble-size by some 
principled estimation. Unfortunately, our attempts in that direction have not 
been successful so far. We believe that something better than just presetting the 
maximal ensemble size to some ridiculously high value must exist. 

A WrapperID class as well as an appropriate interface for iterative clas- 
sifiers, and a few exemplar iterative classifiers will all be included in the next 
version of the WEKA machine learning workbench [20], which is available^ under 
the Gnu Public License. 
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Table 7. Significant wins, draws, and significant losses for all four noise-levels and the 
following four algorithms: ADTree, bagged ADTree, wrapped ADTree, and a wrapped 
ensemble over ADTree. An entry “i-j-k” means that the row algorithm wins significantly 
i times, draws j times, and significantly looses k times against the column algorithm. 



ADTree 




no noise Bagging Wrapping Wenaembie 


10% noise Bagging Wrapping Wsnaembie 


ADTree 1-5-10 4-9-3 0-9-7 

Bagging 8-7-1 3-8-5 

Wrapping 0-6-10 


ADTree 0-3-13 0-10-6 0-3-13 

Bagging 8-6-2 2-12-2 

Wrapping 0-5-11 


20% noise Bagging Wrapping Wenaembie 


30% noise Bagging Wrapping Wsnsembie 


ADTree 0-2-14 0-4-12 0-1-15 

Bagging 3-9-4 2-7-7 

Wrapping 0-7-9 


ADTree 0-6-10 1-6-9 0-4-12 

Bagging 0-11-5 0-10-6 

Wrapping 0-10-6 


AdaBoost 




no noise Bagging Wrapping Wenaembie 


10% noise Bagging Wrapping We„sem6ie 


AdaBoost 2-2-12 3-9-4 1-4-11 

Bagging 8-6-2 3-11-2 

Wrapping 1-8-7 


AdaBoost 0-2-14 2-6-8 0-5-11 

Bagging 6-9-1 1-14-1 

Wrapping 0-12-4 


20% noise Bagging Wrapping We„sembie 


30% noise Bagging Wrapping Wsnaembie 


AdaBoost 0-6-10 1-8-7 0-6-10 

Bagging 5-9-2 0-14-2 

Wrapping 0-13-3 


AdaBoost 0-15-1 1-9-6 0-8-8 

Bagging 2-10-4 0-13-3 

Wrapping 0-15-1 


tADTree 




no noise Bagging Wrapping Wenaembie 


10% noise Bagging Wrapping Wsnsembie 


tADTree 2-4-10 3-9-4 0-7-9 
Bagging 7-8-1 5-5-6 
Wrapping 0-8-8 


tADTree 0-3-13 1-7-8 0-5-11 

Bagging 6-7-3 4-7-5 

Wrapping 0-8-8 


20% noise Bagging Wrapping Wenaembie 


30% noise Bagging Wrapping Wsnsembie 


tADTree 0-6-10 0-8-8 0-4-12 

Bagging 1-11-4 1-6-9 

Wrapping 0-10-6 


tADTree 0-12-4 0-6-10 0-5-11 

Bagging 1-7-8 0-8-8 

Wrapping 0-14-2 
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Abstract. We develop notation for describing a temporal structure over 
the real numbers flow of time. This forms a basis for various reasoning 
tasks including synthesizing a model from a given temporal or hrst-order 
specihcation. We announce an efficient procedure for finding a manage- 
able description of such a model. There are applications in reasoning 
about multi-agent systems, understanding natural language, analogue 
devices, robotics and artificial reasoning. 



1 Introduction 

Linear temporal logic with a real-numbers flow of time is one of the most im- 
portant and applicable of the many and varied temporal logics. A continuous 
model of time respects everyday human intuitions, allows dense activity by any 
number of parallel components (agents or threads as well as hardware) or by the 
environment, and can support arbitrary overlapping intervals of states and so, 
as argued in [16] or [8], may be suited for many applications, ranging from philo- 
sophical, natural language and AI modelling of human reasoning to computing 
and engineering applications of concurrency, refinement, open systems, analogue 
devices and metric information. In contrast to the situation with discrete time 
steps [25], or discrete branching of discrete steps, the continuous model and its 
temporal logics have not been well understood. 

Any dense model of time may be appropriate for many of these applica- 
tions but the real-numbers are probably the most specifically correct in terms 
of intuitions and classical physics. There are interesting differences between the 
temporal logics of the reals and of other dense flows ([8], [11] or [12]) but here 
our primary concern is the logic over the real numbers and the technical results 
will be about that logic. 

For some applications it is sufficient to impose a “finite variability” require- 
ment on the truth values of atoms and hence reduce (albeit messily) the problems 
to standard discrete time tasks. See [28] and [16] for examples. The idea is that 
each atom is only allowed to change its truth value a finite number of times 
during each bounded interval of time. This assumption is acceptible when we 
are considering a closed system of discretely ticking components taking on bi- 
nary valued states. However, the finite variability assumption is not appropriate 
when we want to reason about the unlimited environment of a typical robot, the 
unbounded openness of an agent exposed to the Internet, an information system 
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which includes one or many human users, a system with some states defined in 
terms of ranges of continuous valued measurements, a careful argument about 
new laws of physics in unfamiliar domains or the full richness of behaviour ex- 
pressible in human language. So we make no such finite variability assumption: 
valuations are unrestricted and the logic general. In fact, in our logic it is possi- 
ble to easily specify which components do satisfy finite variability while leaving 
other components (including the environment) unrestricted. 

The most natural and useful such temporal logic is propositional temporal 
logic over real-numbers time using the Until and Since connectives introduced in 
[15]. We will call this logic RTL in this paper. We know from [15] that this logic 
is as expressive as the first-order monadic logic of the real numbers and so at 
least as expressive as any other usual temporal logic that could be defined over 
real-numbers time (see [35] for a brief account of a less expressive logic) . RTL is 
decidable [8] and complete axioms systems are given in [11] and [29]. 

The decision procedure in [8] uses Rabin’s non-elementarily complex decision 
procedure for the second-order monadic logic of two successors. In fact, deciding 
validity in the first-order monadic logic of the reals is a non-element ary problem 
[37] . The surprising recent result in [32] is that deciding (validity or satisfiability 
in) the equally expressive RTL is PSPACE-complete. 

In this paper we move the development on another step and consider how to 
usefully describe possible models of a satisfiable formula in the language. This 
is not straightforward as we are dealing with the vissisitudes of propositional 
atoms over a richly complicated and uncountable flow of time. One of the main 
contributions of this paper is a straightforward recursive notation which allows 
the description of some real-flowed structures in terms of simple combinations 
of simpler structures. We also announce that any temporal formula (and hence 
also any first-order monadic formula) satisfiable over the reals is satisfiable in 
one of these constructible real-flowed structures. 

The main result is a synthesis procedure or way of building a model of a 
satisfiable RTL formula. We sketch an EXPTIME procedure for finding the 
description of a model from any given satisfiable RTL formula. EXPTIME is 
shown to be a best possible bound. 

The proofs of these results use the new mosaic techniques for temporal logic 
developed in [30], [17] and [32]. These mosaics are small pieces of a real-flowed 
structure. We try to find a finite set of mosaics which is sufficient to be used 
to build a real-numbers model of a given formula. Then we build the model. 
Unfortunately, the proofs of these results are too long and complicated to give 
in full in this paper and will be presented in a longer version [31]. 

In the next section we define RTL. We then describe some of the many 
important application areas for the logic and carefully contrast the use of RTL 
in these areas with some of the established techniques (such as interval logics). 
The new notation for describing an RTL model is given in section 6 where we 
also give a brief sketch of the important new synthesis result. Finally we describe 
some potential applications of the notation and the synthesis construction. 
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2 The Logic 

Fix a countable set £ of atoms. Here, frames (T, <), or flows of time, will be 
irreflexive linear orders. Structures T = (T,<,h) will have a frame (T,<) and 
a valuation h for the atoms i.e. for each atom p & C, h(p) C T. Of particular 
importance will be real structures T = (K, <,/i) which have the real numbers 
flow (with their usual irreflexive linear ordering). 

The language L{U,S) is generated by the 2-place connectives U and S along 
with classical ^ and A. That is, we define the set of formulas recursively to 
contain the atoms and for formulas a and [3 we include ^a, a A /3, pUa and 
fiSa. As we will see (3U a means that (3 holds at all times until a time when a 
holds. Hence the “/3 until a” reading. Similarly for “since”. 

Formulas are evaluated at points in structures T = (T, <,h). We write T,x \= 
a when a is true at the point x £ T. This is defined recursively as follows. 
Suppose that we have defined the truth of formulas a and (3 at all points of T. 
Then for all points x: 

T ,x\= p iff X G h{p), for p atomic; 

T,x \= —<a iff T, X ^ a; 

T, X \= a f\ (3 both T, x |= a and T,x\= j3\ 

T ,x\= (3U a iff there is j/ > x in T such that T ,y \= a 

and for all z G T such that x < z < y we have T,z \= P; and 
T,x \= pSa iff there is j/ < x in T such that T,y \= a 

and for all z G T such that y < z < x we have T,z \= p. 

Definitions, results or proofs will often have a mirror image in which U and 
S are exchanged and < and > swapped. There are also plenty of abbreviations 
including “truth” T = p A ^p; “falsity” T = ^T; “will” Fa = ^{l.Ua); “will 
always” Ga = ~^F^a; “for a while” F^a = aUT ; and “arbitrarily soon” AT+a = 
~^F+{^a). 

In [15] it is shown that RTF is as expressive as the first-order monadic logic 
of the reals order. See [10] for details. This means that in terms of expressiveness 
it is the right temporal logic for such structures. It is important to note that 
there is a very similar looking but less expressive [32] temporal logic built from 
so called non-strict until and non-strict since which we do not consider in this 
paper. 

A formula p of L{U, S) is M.-satisfiable if it has a real model: i.e. there is a 
real structure S = (K, <, h) and x G M such that S,x \= p. A formula is ^-valid 
iff it is true at all points of all real structures. Of course, a formula is R-valid 
iff its negation is not K.-satisfiable. The set of RTF formulas which are R- valid 
has been axiomatized with the help of a special irreflexivity rule in [11] and 
also in [29] where only traditional inference rules are used. Fet RTF-SAT be the 
problem of deciding whether a given formula of L{U, S) is R-satisfiable or not. 
The main result in [32] is: 



Theorem 1. RTL-SAT is PSPACE-complete. 
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3 RTL Instead of Intervals 



It is common to see interval based temporal logics and related formalisms used 
in AI applications which may have dense time semantic underpinning. These 
include artificial planning tasks [1], semantics for tense and aspect in natural 
language [19] and reasoning about the evolution of spatial relationships [36]. The 
density of time allows the arbitrary overlap of different actions or states. 

There has been much discussion about the relative merits of point-based 
versus interval-based temporal logics in such applications [38]. It is probably 
generally accepted that in many applications an interval-based representation 
can be interpreted in terms of states holding at the points which make up the 
interval. (See [13], chapter 8, for a rare exception). 

For constraint problems in planning it is often sufficient to just reason in 
terms of networks of intervals and the order relations between each pair. In 
fact, there are some very efficient procedures [23] for answering specific planning 
questions [22]. However, for more sophisticated or more general reasoning about 
intervals an interval temporal logic is necessary [14]. There is a modal diamond 
in Halpern and Shoham’s logic HS for each of the thirteen of Allen’s relations 
between intervals [2]. For example, {D)p is true of an interval x iff there is 
another interval y making p true such that cc is a subinterval of y, ie x is related 
to y by Allen’s “during” relation. 

Unfortunately HS is highly undecidable: it is not even recursively axiomatiz- 
able. Because of this, researchers have often imposed restrictions on the temporal 
structures which are reasoned about and thus been able to define more manage- 
able interval logics. For example, there is often a “homogeneity” assumption (see, 
for example, [18]) which does not allow properly overlapping intervals to satisfy 
the same atomic propositions unless they are both properly included in a larger 
interval which also satisfies that proposition. It thus becomes possible to recast 
such structures in a point-based way instead: a proposition holds at a point iff 
there is some interval containing that point which satisfies the proposition. 

Another approach to making interval logics more manageable (in this case 
axiomatizable) is Venema’s “flat” interval logic [39] in which a proposition holds 
at an interval iff it holds at the start point. Again this can easily be recast in 
point-based form. 

Yet another approach to interval logics (usually in the discrete case) seen 
in [21] and [7] is based directly on evaluating truth of atomic propositions at 
points. 

If it is the case as we have seen that the kind of reasoning about structures 
done using intervals can often be done with atomic propositions being evalu- 
ated at points then the expression and specification of properties can equally be 
done using the first-order monadic logic. For example, suppose that under our as- 
sumption we represent proposition p holding of the interval (u, v) by the monadic 
condition \/z{{u < z < v) ^ P{z)) using a 1-ary predicate. The HS formula {D)p 
holding at {u, v) then becomes 3x3y{{x < u < v < y)A'iz{{x < z < y) ^ P{z))). 
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Thanks to Kamp’s theorem [15], we see then that we can equally well use 
point-based temporal logics such as RTL to express these properties. To say that 
now is the beginning of an interval satisfying {D)p becomes simply pSpApApUp. 

4 RTL and Timing 

In many applications of complex systems, timing or metric considerations are 
important. Reasoning about the behaviour of safety critical systems [24] and 
multimedia specifications [6] are just two examples. A good account of this so- 
called real-time logic area appears in [3]. Most of the timing work is built on 
discrete time temporal logics and indeed any move to a dense order of times 
usually results in highly undecidable logics [3]. 

Despite these sorts of results it is not hard to add a limited form of metric 
expressiveness to RTL: in fact it can be added within RTL with no loss of 
decidability. The basic idea is that we add a proposition whose truth represents 
the ticking of a clock and then we express timing requirements between events 
in terms of the number of ticks inbetween. The constraints can be made more 
strict by introducing grades of granularity of ticking. 

We introduce a new proposition tick, say. Let us suppose that we start 
ticking at a certain time: it is not hard to change this construction if ticking 
forever into the past is required instead. By including the following conjuncts 
we impose a simple metric on our real structure: 

tick A G((^tick)[/tick) A G((^tick)S'tick). 

The last conjunct is just a so-called non-Zeno condition which ensures “finite 
variability” of the ticking. 

With a ticking clock we can now introduce some simple metric information 
into RTL formulas. For example, to approximate the condition that p must be 
followed within 1 tick by a g we can use G(p ^ 0<2<?) with 0 < 2 (j an abbreviation 
for ^((^<7 A ^tick))[/(tick A {^q) A {~^q A ^tick)C/tick)). 

If finer granularity of metrics are required then obviously we can introduce 
a finite few finer layers by requiring a certain number of sub-ticks between ticks 
and sub-sub-ticks between them and so on. The kind of logics needed for such 
granular reasoning can be found in [20] . They are usually computationally com- 
plex. 

Of course, using abbreviations in terms of ticks as above does not allow us to 
quantify over metric values. However, as pointed out in [3] it is just this facility 
which makes metric temporal logics undecidable. It is also arguably the case that 
most practical applications of the metric logics can be specified without recourse 
to quantification. 

5 RTL and Open or Compositional Systems 

The suggestion that dense time temporal logics might be useful for compositional 
reasoning (as in distributed or multi-agent systems) was first made in [5]. If 
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discrete time is used instead, then all sorts of difficulties arise when the steps(or 
ticks) of one component do not match up with the steps of another component 
or what is taken to be the steps of the combined system. The notation can 
become quite messy when proof rules have to allow for this mismatch as in [9] 
for example. The alternative with discrete time is the draconian supposition 
that there is a universal clock for all modules as in [24]. In contrast, if we use 
dense time then we do not need any such assumption. The proof rule is perfectly 
simple: from a holding of one component and (3 of another we can simply deduce 
that a A /3 holds of the combined system. 

Open systems, those in the presence of an unpredictable environment, present 
even more difficulties for discrete time formalisms as there is not necessarily any 
appropriate notion of its steps. As discussed in the introduction there are many 
circumstances when any sort of finite variability assumption becomes unten- 
able. Once again, when we move to a dense model of time, a formal account is 
straightforward. For example, to say that no matter how soon we measure after 
the water boils, the steam pressure will have become detectable, simply use 

boil ^ ^((^detectable)C/measure). 

In the near future, verification of properties of neural networks [33] may 
become an important application area for formal metric compositional reasoning 
in dense time. 

6 Building Structures 

We now introduce a notation which allows the description of a temporal struc- 
ture in terms of simple basic structures via a small number of ways of putting 
structures together to form larger ones. In all cases the underlying flow of time 
of each structure will be an interval of the real numbers. Crucially we also allow 
the possibly of a singleton interval in which the flow of time is one point. 

The general idea is simple: using singleton structures we build up to more 
complex structures by the recursive application of four operations. They are: 

— the sum of two structures, consisting of one followed by the other; 

— to repeats of some structure laid end to end towards the future or alterna- 
tively towards the past; 

— and making a densely thorough shujfle (see below) of copies from a finite set 
of structures. 

These sorts of operations are familiar in the study of linear orders (see, for 
example, [8]) and the details of the notation and operations could have been 
done in a variety of ways. 

There are slight complications to do with the need to join structures up end 
to end in the right way without leaving a gap. We choose to solve this problem 
by classifying our structures according to whether the end points are included 
or not. Thus we have open-open, open-closed, closed-open and closed-closed 
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interval structures depending on whether the left end or right end respectively 
is included. Singleton structures are closed-closed. We want to end up with an 
open-open structure whose flow of time will thus be isomorphic to the whole real 
numbers. 

Yet another unimportant complication is the fact that in giving a description 
of a particular and well-defined structure we must give a particular and well- 
defined concrete form to the shuffle. In this paper we will ignore this complica- 
tion because, as we show in the longer paper, it turns out that any sufficiently 
thorough mixture of the sub-structures would do equally well. 

The only really important detail is that the shuffle must involve at least one 
singleton structure. Copies of this are distributed as a sort of background filler 
throughout the shuffle thus guaranteeing the two crucial properties of intervals 
of the reals: Dedekind completeness and separability. Recall that we say a linear 
order (T, <) is Dedekind complete if any subset S CT with an upper bound (ie 
there is t G T such that for all s S S', s < t) has a least upper bound. Recall 
also that the reals are separable ie have a countable suborder (eg the rationals) 
which are distributed densely throughout the reals. 

So now we are interested in structures (T, <, h) with (T, <) being order iso- 
morphic to some interval of the real numbers. 

A singleton structure is just a structure X = ({a;}, 0, h). 

Assume we have interval structures Tj = (Ti, <i,hi) and Tz = (Tz, < 2 , hz) 
with 7i closed on the right and Tz open on the left. The sum Tj -|- 72 of Tj and 
Tz is then defined to be (T, <, h) where: 

T={{l,t)\t€Ti}U{{2,t)\t€Tz}; 

{i, t) < (j, s) iff t < j or z = j and t < s; 

Hp) = G hi{p)} U {{2,t)\t £ hz{p)}. 

This is clearly an interval structure itself: with a classification dependent on the 
left end of Tj and the right end of Tz . Similarly we can And the sum of an interval 
open on the right with one which is closed on the left. 

Given an interval structure Tj = (Ti, <i,hi) which is closed on the left and 
open on the right we define the structure wax (Tj) to be T = (T, <, h) given by: 
T={{-i,t)\t£T}-, 

(z, t) < (j, s) iff z < j or z = j and t < s; 

\i,t) e h{p) iff t G hi{p). 

This is an open-open interval structure. 



71 I Ti I Ti I Ti 



T = wax(Tl) 

The mirror image is the open-open interval structure wane (71). 

Now suppose that we have singleton structures Xq, ...,Xr (r > 0) and non- 
singleton closed-closed interval structures T, ...,71 (s > 0). In the full version of 
the paper [31] we define the shujfle shuff(Ao, ..., A^, 71, ..., 71, ) to be a particular 
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open-open interval structure T = (T, <, h) which has a dense thorough mixture 
of copies of all the intervals (i.e. all Xi and all Tj) in between. However, as 
mentioned above, any thoroughly dense mixture of the and Tj will do: between 
any two copies of any Xi or any Tj must lie copies of each of the Xi and each of 
the Tj. The singleton structure Xq, which must be present, is used as a sort of 
background filler to ensure that the combined order is Dedekind complete while 
still being separable. The case of s = 0 is allowed in which case the shuffle is 
built from just singletons. 

We define the set of constructible interval structures to contain all the single- 
tons and be closed under constructing sums, waxes, wanes and shuffles. Define 
the set of Constructible real structures to be the open-open constructible interval 
structures. 

It is straightforward to make the notation completely formal in the case of 
a finite set of atoms, and this is the case when we are considering a particular 
temporal formula. For example, let [p, ^q] represent a singleton structure with 
the obvious valuation. We might then suggest 

shuff([p, q]) + [p, q] + shuff([p, ^g], [p, g]) -h [p, q] + shuff([p, g]) 
as a model of Gp A (^((^g)C/g) A ^{qU q))U q. 




Q a dense mixture 9 



of g and ^g 

The main result announced in this paper is that an RTL formula has a real- 
flowed model iff it has a constructible real model. Thus we can describe a model 
in our new notation. 

Theorem 2. A formula (j) from L{U, S) is M-satisfiable iff there is a con- 
structible real model of 4>. There is some c such that, in that case, a model 
can be described by an expression of shuffles, waxes, wanes and sums of length 
< (this bound is best possible). 

Furthermore, there is an EXPTIME procedure for finding such an expression. 

The proof of the theorem is long, complex and technical in parts. It re- 
lies on the new temporal mosaic techniques developed in [30], [17] and [32] as 
well as some deep reasoning about properties of the real numbers. Much of the 
groundwork is laid in the proof of the complexity of the decision problem [32]. 
However, there is also important new work on the properties of shuffles, back 
and forth morphisms and a procedure for enumerating representations of models 
constructed from sets of mosaics. The full details will be presented in [31]. 

We only have space here to give a brief sketch of the proof. Suppose that 
we want a model of <j). Only a small finite closure set of subformulas of <j) and 
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their negations will be of interest to us. Mosaics are triples of sets of formulas 
from the closure set representing the formulas which are true at two points in a 
model and the formulas which hold at all points in between. That is the intuition 
behind the concept of a mosaic: the actual definition is just in terms of several 
conditions, called coherency conditions, on the relationship between the formulas 
which may appear in the three sets. For example, if Ga is in the set representing 
the earlier time point then a must be in the set representing the later point. 

We are able to show that the satisfiability of <j) is equivalent to the existence 
of a finite set of such mosaics called a real mosaic system (RMS) which is closed 
under certain conditions (called saturation conditions). The saturation condi- 
tions impose a hierarchy of layers on the set of mosaics and require that mosaics 
in one layer can be decomposed in terms of shuffles, waxes, wanes and sums 
of mosaics in lower layers. The required model construction expression can be 
extracted recursively from these decompositions. 

The bound on the length of the expression can be extracted from a linear 
bound on the number of levels needed in the set and a bound on the length of 
each decomposition from one level in terms of those below. 

We can also show that length bound is best possible by considering a formula 
describing a binary counter. Given n, use n atoms to describe a counter which 
increases at discrete intervals. A formula of quadratic length (in n) can be used to 
specify such a model but a description of the model in our construction notation 
needs to be of exponential length. 

Finally, we can give an EXPTIME procedure for finding and printing out a 
model of any satisfiable RTF formula. The set of all ((^mosaics is of size expo- 
nential in the length of (f). There is a fairly straightforward procedure (in the 
style of [27]) for going through the set repeatedly and removing mosaics which 
can not be fully decomposed in terms of other simpler ones in the set. If (j) is 
satisfiable we will eventually end up with an RMS and another straightforward 
EXPTIME procedure reads out the description of a model of 4>. By repeatedly 
decomposing mosaics as specified in the RMS we can produce the expression in 
a top-down manner. 

Note that thanks to the expressive completeness result in [15], we know that 
any satisfiable sentence of the first-order monadic logic of the reals also has a 
constructible real model. To find a description of a model from the sentence 
must be a hard problem as deciding validity in this logic is non-elementarily 
complex [37]. One could use the separation technique of [10] to first find an 
equivalent temporal formula and then use the procedure above. The translation 
to the temporal formula must be the time-consuming part of the process. 

Other results from [31] allow us to conclude that if we find all possible start- 
ing points (ie relativized mosaics in the RMS) and follow all possible ways of 
decomposing the mosaics then we will eventually output a list of possible models 
of the formula which is in a certain sense exhaustive. Any real model of (j) will 
be back-and-forth equivalent to one of the constructible models which is listed. 

It is worth noting that there is a similar sort of result to our theorem in [8] 
where it is shown that a RTF formula has a real-flowed model iff it has a model 
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with the valuation of each atom being a Borel set, ie one obtained from open sets 
by iterated application of complementation and countable union. The important 
advantages of our new result are that we can give a finite representation in our 
notation and we give an efficient means for finding it. 

We have space for a few more brief descriptions of the notation in action. 
The formula qUp has a model shuff([p, g]) (where, as above, [p, g] is the singleton 
model with p and q holding at the point). The formula K~^q A r~^{{^q)Uq) A 
A ^r+(^p)) has a model 

shuff([p, g]) + [p, q] + wax{[p, q] + shuff([p, ^q], [^p, ^g])). 

The formula T+(pV {{^p)Up)) A K^{{^p)Up) A T+(p ^ K^p) A ~^{{^p)Up) has 
a model 

shuff([p], hp] + shuff(hp]) + [p]). 



7 Conclusion 

We have seen that reasoning with continuous time has many and varied impor- 
tant applications across computing, AI and systems engineering. 

Deciding satisfiability (or validity), as investigated in [32], is one of the most 
important reasoning tasks but others include synthesis, model-checking and de- 
ciding realizability. These three all involve notions of a formal representation of 
a model and so we consider how our new notation helps with these tasks. 

Our main theorem above is a synthesis result: it shows how to make a model 
of a given formula (assuming the formula is satisfiable) . In many application 
areas it is helpful to be able to check and develop formal specifications in terms 
of intuitions. One way to do this is to build a model (or several models) of 
the specification and see what other properties the model has. Identifying a 
preferred model of a statement is itself one of the main tasks of natural language 
processing. The RTL synthesis procedure can be used for this purpose. It is also 
useful for giving concrete counter-examples to incorrect consequences such as 
a desired property supposedly following from a detailed specification. Efficient 
systhesis techniques in limited sublanguages may also give rise to executable 
continuous temporal logics (c.f. [4]). 

In fact, now that we we know RTL is no harder to reason with than well- 
known discrete time logics — discrete time formulas also can have models which 
take exponential size expressions to describe [34] — we can follow the suggestions 
in [16] and use continuous time for many of the compositional reasoning tasks 
associated with verification of specifications. 

The construction notation itself might also be useful as a way of for describ- 
ing real-flowed structures which can then be “model-checked” against a temporal 
or monadic specification. Techniques for model-checking in RTL need to be de- 
veloped. 

Other important future work includes development of the idea of “realiz- 
ability” in continuous time. This is the question which arises when a temporal 
specification is given and we want to know whether a system in control of only 
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some of the propositions in the specification can guarantee that the specification 
can be met whatever the behaviour of the environment which controls the rest of 
the propositions. Early work in a special case has been done in [26] but a fuller 
account will need development of continuous branching time logics. 
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Abstract. It is by now widely accepted that a number of tasks in natural 
language understanding (NLU) require the storage of and reasoning with 
a vast amormt of backgrormd (commonsense) knowledge. While several 
efforts have been made to build such ontologies, a consensus on a 
scientific methodology for ontological design is yet to emerge. In this 
paper we suggest an approach to building a commonsense ontology for 
language understanding using language itself as a design guide. The idea 
is rooted in Frege’s conception of compositional semantics and is related 
to the idea of type inferences in strongly-typed, polymorphic 
programming languages. The method proposed seems to (i) resolve the 
problem of multiple inheritance; (ii) suggest an explanation for polysemy 
and metaphor; and (Hi) provide a step towards establishing a systematic 
approach to ontological design. 



1 Introduction 



Recent work in natural language understanding (NLU) seems to be slowly 
embracing what we like to call the ‘understanding as reasoning’ paradigm, as it 
is quite clear by now that understanding natural language is, for the most part, 
a commonsense reasoning process at the pragmatic level, for example in such 
tasks as reference resolution, plan recognition, lexical disambiguation, 
prepositional phrase attachments, temporal coherence, and the resolution of 
quantifier scope ambiguities. For instance, consider the resolution of ‘He’ in the 
following: 

John shot a policeman. He immediately 

a) fled away. (1) 

b) fell down. 

It is quite difficult to imagine how children effortlessly resolve such references, 
if not by recourse to the commonsense facts that, typically, when shot{x,y) 
holds between some x and some y, x is the more likely subject to flee and y is 
the more likely subject to fall down. Other examples of commonsense reasoning 
in language understanding involve the resolution of quantifier scope 
ambiguities. Consider the following: 



every 



\ few 



> graduate student{s) at MIT submitted a paper to A CL ’99 



two 



( 2 ) 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 426-437, 2001. 
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We argue that the plausibility of wide scope a (implying a single paper) 
increases as the number of students involved in the relation decreases. Lacking 
a syntactic or a semantic explanation, this inference must be a function of our 
commonsense knowledge of how the ‘submit’ relation between students and 
‘papers’ is typically manifested in the real world. Specifically, this inference is 
based on our commonsense belief that the submit relation between a student 
and a paper is typically [l..m]-to-l, where m is some small number. Moreover, 
different individuals seem to have a slightly different value for m, which is 
consistent with the findings of Kurtzman and MacDonald (1993) that different 
individuals seem to have different scope preferences in the same textual 
context.^ 

The ‘understanding as reasoning’ paradigm is certainly not entirely new in 
NLU research. Within the AI community, this paradigm was implicitly 
embraced by a number of authors (e.g., see Charniak, 1986; Hirst, 1986; Wilks, 
1975; Schank, 1982). Unfortunately, however, these approaches were largely 
based on ad hoc algorithms built on top of informal knowledge representations. 
Due to the lack of formality, these procedures were hopelessly unscalable, and 
scalability was for the most part attempted by pushing the problem from the 
procedures to the data; which consequently led to the so-called knowledge 
bottleneck. The lack of progress in solving the knowledge bottleneck problem 
generally led AI researchers to either abandon inferential and knowledge-based 
approaches in favor of more quantitative approaches (e.g., Charniak, 1993), or 
to focus almost exclusively on the development of large commonsense 

knowledge bases (e.g., Tenet and Ghua, 1990). Within linguistics and formal 
semantics, one the other hand, little or no attention was paid to the issue of 
commonsense reasoning at the pragmatic level. Indeed, the prevailing wisdom 
(which might be partly due to lack of progress in Al-based NLU) was that a 
number of NLU tasks require the storage of and reasoning with a vast amount 
of background knowledge (van Deemter, 1996), an opinion that led some (e.g., 
Reinhart, 1997) to conclude that pragmatic approaches are ‘highly 

undecidable’. 

In our view both trends were partly misguided. In particular, we hold the 
view that while language understanding is for the most part a commonsense 
reasoning process at the pragmatic level, this reasoning process and the 

underlying knowledge structures that it utilizes must be formalized if we ever 
hope to build scalable systems. In this light we believe the work on integrating 
logical and commonsense reasoning in language understanding (e.g., Allen, 
1987; Pereira & Pollack, 1991; Zadrozny & Jensen, 1991; Hobbs, 1985; Hobbs 
et al, 1993; and more recently Asher & Lascarides, 1998; and Saba & 

Corriveua, 1997) is of paramount importance. Much of this work is directed 
towards formulating commonsense inferencing strategies to resolve a number of 
ambiguities at the pragmatic level. Although it has been shown (see Saba & 
Corriveau, 2001) that these inferences do not always require the storage of and 
reasoning with a vast amount of background knowledge, it is clear that a 
number of tasks do require such a knowledgebase. Indeed, substantial effort has 
been made towards building ontologies of commonsense knowledge (e.g., Lenat 
& Ghua, 1990; Mahesh & Nirenburg, 1995; Sowa, 1995), and a number of 
promising trends that advocate ontological design based on sound linguistic 
and logical foundations have started to emerge in recent years (e.g., Guarino & 
Welty, 2000; Pustejovsky, 2001). However, a systematic and objective approach 
to ontological design is still lacking. In particular, we believe that an ontology 
for commonsense knowledge must be discovered rather than invented, and thus 



1 



An inferencing strategy that models individual preferences in the resolution of scope 
ambiguities at the pragmatic level has been suggested in (Saba and Corriveau, 2001). 
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it is not sufficient to establish some principles for ontological design, but that a 
strategy by which a commonsense ontology might be systematically and 
objectively designed must be developed. 



2 Language Use as Guide to Ontological Design 



Our basic strategy for designing an ontology of commonsense knowledge is 
rooted in Frege’s conception of Compositionality. According to Frege (see 
Dummett, 1981, pp. 4-7), the sense of any given sentence is derived from our 
previous knowledge of the senses of the words that compose it, together with 
our observation of the way in which they are combined in that sentence. The 
cornerstone of this paradigm, however, is an observation that has not been 
fully appreciated regarding the manner in which words are supposed to acquire 
a sense. In particular, the principle of Compositionality is rooted in the thesis 
that “our understanding of [those] words consists in our grasp of the way in 
which they may figure in sentences in general, and how, in general, they 
combine to determine the truth-conditions of those sentences.” (Dummett, 
1981, pp. 5). This simple idea forms the basis of our strategy in designing an 
ontology for commonsense knowledge: what language allows one to say about a 
concept, tells us a lot about the concept under consideration. In other words, 
the meanings of words (i.e., the concepts), can be discovered from the manner 
in which the words are used in everyday language. As Bateman (1995) has 
suggested, language is the best-known theory on everyday knowledge. Assuming 
that language reflects thought, therefore, analyzing patterns of everyday 
language ‘use’ should provide useful clues to the structure of commonsense 
knowledge. As a motivating example, consider the nouns table and elephant, 
and the adjectives smart and large, out of which four syntactically well formed 
and semantically valid adjective-noun combinations can be made. One of these 
combinations, namely smart table, is typically rejected on pragmatic grounds, 
as it is at odds with our commonsense view of what tables are^. In particular, 
while it is sensible to say large elephant and large table, a table is not the kind 
of object for which smart applies. This analysis results in the fragment 
hierarchy shown in figure in 1 below. 




{elephant , table } 




(table } (elephant } 

Figure 1. A simple analysis of four adjective- noun combinations. 

Note that this kind of analysis is not much different from the type inferencing 
process that occurs in strongly typed, polymorphic programming languages. 
For example, consider the linguistic patterns and the corresponding type 
inferences shown in table 1. From x -|- 3, for example, one can infer that x is a 
number since numbers are the “kinds of things” that can be added to 3. In 



^ For the moment we are not concerned with metaphor. 
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general, the most generic type possible is inferred (i.e., these operations are 
assumed to be polymorphic). 



Linguistic Type 

Pattern Inference 


X -h 3 


X is number 


reverse (x) 


X is a sequence 


insert(x,y) 


X is an object; y is sequence of x objects 


head(x) 


X is a sequence 


even(a:) 


X is number 



Table 1. Linguistic patterns and the corresponding type inferences. 

For example, all that can be inferred from reverse(x) is that x is the generic 
type sequence, which could be a list, a string (a sequence of characters), a 
vector, etc. Note also that in addition to actions (methods), properties (truth- 
valued functions) can also be used to infer the type of an object. For example, 
from even[x) one can infer that x is a number, since lists, sequences, etc. are not 
the kinds of objects which can be described by the predicate even. Consider a 
set of concepts C and a set P of properties (or actions) that may or may not be 
sensibly applied to concepts in C: 

C = {list, string, set} 

P = {empty, memberOf, size, tail, head, reverse, toUpperCase) 

Shown in figure 2 below is a number of sets = {c | app{p, c)} that are 

generated by the predicate app{p,c) which is true if the property p is applicable 
to the concept c (figure 2a); and the concept hierarchy implied by the subset 
relationship among these sets (figure 2b). Note that each (unique) set 
corresponds to a class in the hierarchy. Equal sets (e.g. and ) 

correspond to the same class. A class could be given any meaningful label that 
intuitively represents all the concepts in the class. For example, in figure 2b 
sequence is used to collectively refer to sets, strings and lists. 



= {list , String } 

Ctii = {list , string } 

Ctod = {list , string } 

^2ize = {list , set , String } 
C»«bdrc£ = {list , set , string } 



^ empty 
^tsUccet 



: {list , set , string } 
, g = {string } 



eonpty +*' 
- memberOf 
- size + 



ordered (fets 
+ toUpperCas ( 



sequence 

+ tail - 
+ head - 
+ reverse - 



unordered (i-ots) 



string 



(a) ^i.) 

Figure 2. Sets generated by app[p.p) and the hierarchical structure implied by them. 



Clearly, there are a number of rules that can be established from the concept 
hierarchy shown in figure 2. For example, one can state the following: 



iy e){app{reverse,e) n app{size,c)) 
(3c){app{size,e) a —iapp{reverse,c)) 
y e){app{tail,e) = app{head,c)) 



( 3 ) 

( 4 ) 

( 5 ) 
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Here (3) states that whenever it makes sense to reverse an object c, then it also 
makes sense to ask for the size of c. This essentially means that an object to 
which the size operation can be applied must be a parent of an object to which 
the reverse operation can be applied. (4), on the other hand, states that there 
are objects for which the size operation applies, but for which the reverse 
operation does not apply. Finally, (5) states that whenever it makes sense to 
ask for the head of an object then it also makes sense to ask for its tail, and 
vice versa. Thus while there must be at least one property that defines a 
concept, there could be many (we will have more to say about this below.) 
Finally, it must be noted that in performing this analysis we have assumed 
that the predicate app{p,e) is a Boolean- valued function, which has the 
consequence that the type hierarchy is a strict binary tree. In fact, this is one 
of the main characteristics of our method, and has led to two important 
results: (i) multiple inheritance is completely avoided; and (ii) by not allowing 
any ambiguity in the interpretation of app{p,e), lexical ambiguity, polysemy 
and metaphor are explicitly represented in the hierarchy. 



3 Language and Commonsense Knowledge 

The work described here was motivated by the following two assumptions: (i) 
the process of language understanding is for the most part a commonsense 
reasoning process at the pragmatic level; and {ii) since children master spoken 
language at a very young age, children must be performing commonsense 
reasoning at the pragmatic level, and consequently, they must posses all the 
commonsense knowledge required to understand spoken language-^. In other 
words, we are assuming that deciding on a particular app{v,c) should not be 
controversial, and that children can easily and consistently answer simple 
questions such as do elephants fly, do mountains talk, do books run, etc. Note 
that in answering these questions it is clear that one has to be coconscious of 
metaphor. For example, while tables, people, and feelings can be strong (i.e., it 
is quite meaningful to say strong table, strong person, strong feeling), it is clear 
that the senses of strong in these three cases are quite distinct. In fact, the 
various metaphorical derivations of a lexeme are eventually discovered by the 
process we describe here, as will become evident in the next sections. The point 
here is that all that matters, initially, is to consider posing queries such as 
app{smart,elephant) to a five-year old. Furthermore, in asking such a query we 
are not asking whether or not every elephant is smart, nor how smart elephants 
can be, but whether or not it is meaningful to say ‘smart elephant’. We believe 
that such queries are binary-valued. In other words, while at the quantitative 
(or a data-level) it could be a matter of degree as to how smart a specific 
elephant might be, for example, the qualitative question of whether or not it 
is meaningful to say ‘smart elephant’ is not a matter of degree'*. With this in 



It may very well be the case that “everything we know we learned in kindergarten”! 

'* We will not dwell on this issue too much here except to say that as Elkan (1993) has 
convincingly argued, to avoid certain contradictions logical reasoning must at some level 
collapse to a binary logic. While Elkan’s argument seemed to be susceptible to some 
criticism (e.g., Dubois et al. (1994)), there are more convincing arguments supporting the 
same result. For example, consider the following: 

(1) John likes every famous actress 

(2) Liz is a famous actress 

(3) John likes Liz 
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mind, our basic approach to discovering the ontology of commonsense 
knowledge can be summarized as follows: 

■ Select a set of adjectives and verbs, V = . 

■ Select a set of nouns C = {cj,...,c„} . 

■ Generate sets C; = {c e C | app(v^,c)}, 1 < i < m for every e V 

■ Analyse the subset relationship between all sets G s {Cj,...,C^} 

As an initial example, consider the set of verbs V = {move, walk, rim, talk, 
reason} and the set of nouns C = {Rational, Bird, Elephant, Shark, Animal, 
Ameba}. Repeated application of app{v,c) results in the following sets: 

Gjjove ~ {Rational, Animal, Bird, Elephant, Shark, Ameba} 

Ctaik = {Rational} 

C'reason = {Rational} 

Cthink = {Animal} 

C*i7aik = {Rational, Bird, Elephant} 

Crm = {Rational, Bird, Elephant} 

First we note that while some decisions could ‘technically’ be questioned (say 
by a biologist), our strategy was to simply consider the question from the point 
of view of commonsense. In deciding on a particular app{v,c) we considered the 
query poised to a five-year old: do elephants fly, do they run, do they talk, etc. 
Questionable situations were simply ignored. This initial process resulted in the 
hierarchy shown in figure 3 below. Some of the sets indicating positive left and 
right attributes are given in figure 4 below. Note that some powerful inferential 
patterns that can be used in language processing are implicit in the structure 

shown in figure 3. For example, what does not think does not hurt (Lj), what 
walks also runs (Lg), anything that lives evolves (Lg and L^), etc. Note that 
according to our strategy every concept at the knowledge- (or commonsense-) 
level must ‘own’ some unique property, and this must also be linguistically 
reflected by some verb or adjective. This might be similar to what Fodor (1998, 
p. 126) meant by ’’having a concept is being locked to a property.” In fact, it 
seems that this is one way to test the demarcation line between commonsense 
and domain specific knowledge. In particular, it seems that domain-specific 
concepts are not uniquely locked to any word in the language. 



4 Polysemy and Metaphor 

In our approach the occurrence of a verb/adjective at any place and at any 
level in the hierarchy always refers to a unique sense of that verb/adjective. 
Therefore one expects similar senses of a lexeme to apply to concepts along the 



Clearly, (1) and (2) should entail (3), regardless of how famous Liz actually is. Using any 
quantitative model (such as fuzzy logic), this intuitive entailment cannot be produced (we 
leave the details of formulating this in fuzzy logic as an exercise!) The problem here is that 
at the qualitative level the truth-value of famous(x) must collapse to either true or false, 
since at that level all that matters is whether or not Liz is famous, not how famous she 
actually is. 
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same path, albeit at different levels in the hierarchy. In particular, one would 
expect that highly ambiguous verbs to apply to concepts higher-up in the 
hierarchy, where various similar senses of a verb v should end-up applying at 
various levels below v. 





LivingThin gs — i?4 + InAnimate Artifacts — i?5 + Construct! ons 






Figure 3. An adult is a physical, living thing that is formed. It evolves, it grows, it 
develops, moves, it can walk, rrm, hear, see, talk, think, reason, etc. 



Consider for example form and formulate, in the sense of forming and 
formulating ideas. Since our method is based on the idea of using such verbs to 
discover the nature of concepts, form and formulate must both apply to ideas. 
Note that if everything that can be ’formed’ can be ’formulated’ and vice versa, 
then these two verbs would be synonymous. However, in this case this is not 
so, since there are things that can be formed but not formulated. For example, 
consider the small fragment shown in figure 5, where it is shown that 
’developing’, ’formulating’, ’forming’, etc. are all specific ways of ’making’ (in 
other words, one sense of ’make’ is ’develop’). Note the eventual split however. 
In particular, while we make, form, and develop both ideas and feelings, 
ideas are formulated while feelings are fostered. 

While the occurrence of similar senses of verbs at various levels in the 
hierarchy indicates polysemy, the occurrence of the same verb (the same 
lexeme) at structurally isomorphic places in the hierarchy indicates 
metaphorical derivations. Consider the following: 



app(run,LeggedThing) 


(6) 


app(run, Machine) 


(7) 


app(run,Show) 


(8) 
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{ideas .feelings } 

+foster -'N. 
/^—formulate + 
feelings ideas 



Figure 5. An explanation of polysemy. 

(6) through (8) state that we can speak of a legged thing, a machine and a 
show running. Clearly, however, these examples involve three different senses 
of the verb run. It could be argued that the senses of run that are implied by 

(7) and (8) correspond to a metaphorical derivation of the actual running of 
natural kinds, the sense implied by (6). It is also interesting to note that these 
metaphorical derivations occur at various levels: first from natural kinds to 
artifacts; and then from physical to abstract. Moreover, the mass/count 
distinction on the physical side seems to have a mirror image of a mass/count 
on the abstract side. For example, note the following similarity between water 
(physical substance) and information (abstract substance): 



■ water/information flows, can be diverted, filtered, processed, etc. 

■ we can be flooded by, or drown in water/information 

■ a little bit of water/information is (still) water/information 



One interesting aspect of these findings is to further investigate the exact 
nature of this metaphorical mapping and whether the map is consistent 
throughout; that is, whether same-level hierarchies are structurally isomorphic, 
as the case appears to be so far (see figure 6)b 



Animal 



Machine 




LeggedLivi ngThing WheeledMac hine 




WingedLegg edLivingTh ing WingedWhee ledMachine 



Figure 6. Isomorphic structures explaining metaphors. 



6 Negation, Immutable Features and Surprise 

The model proposed here allows us to have a very interesting model of the 
negation. To illustrate, consider the following propositions implied by the 
concept hierarchy given in figure 3, namely that generally animals move, and 
people talk: 

app(Move, Animal) (9) 



® Conservatively, the mapping might be a homomorphism and not an isomorphism. 
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app(Talk, Rational) (10) 

What is interesting to consider here is how one interprets the negation of such 
propositions. In particular, there are two possible answers to the query 
— iapp(Move,?X), i.e., to the query ’’what objects do not move?” One can simply 
provide (U - Animal) as an answer, where U is the set of all concepts in the 
universe of discourse. This is the set of all concepts excluding those for which 
app(Move, Animal) holds. Thus, plants and all non-living things do not move 
(see figure 7 below). This is strong negation, since it simply returns the 
complement with respect to the entire structure. However, we argue that there 
is a subtle difference between the following queries: 

Do mountains talk? (11) 

Do elephants talk? (l2) 

Although a rational agent would answer “no” in both cases, one might imagine 
a child replying “nah, mountains do not talk” in response to (11). This must be 
function of the following: elephants fall directly under the negative polarity of 
talk; while this property is not even applicable to mountains (see figure 7 
below). From a Gricean point of view, it seems that “elephants do not talk” is 
somewhat more meaningful than ’’mountains do not talk.” This subtle 
difference in the two cases of negation is crucial in performing commonsense 
reasoning in language understanding. This is also related to the notion of the 
immutability of a feature (Sloman et al., 1998), which is thought to reflect the 
degree to which a concept depends on a certain feature (or, conversely, how 
central is a certain feature to the definition of a concept). 



Thing 



PhysicalTh ing AbstractTh ing 



LivingThin g 



—\talk{x) 




Figure 7. Mountains, elephants, and poems do not talk. 



7 Reasoning with Commonsense Knowledge 

What we are suggesting in this paper is a process that would hopefully lead to 
the discovery of the ontology of commonsense knowledge. This alone would 
clearly do little to building natural language understanding systems unless an 
inferencing strategy that utilizes this ontology is properly formulated. While 
the ontology provides the synthetic knowledge that an NLU system might 
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need, an NLU system must clearly use quite a bit of analytic knowledge. A 
typical example would be the following: 

(Vp,Ci,C 2 )(app(p,Ci) A isa{c^,c^) zd app{p,c^)) (13) 

That is, any property that applies to a concept applies to all its subtypes. 
Clearly there are numerous other such rules that could be added. Another 
important observation here is that the system that will eventually emerge will 
yield a much richer type structure than the (flat) type systems typically 
assumed in formal semantics (e.g., Montague, 1973). For example, form the 
hierarchy in figure 3 one can clearly establish the following: 

Walk : (SLeggedThing ^ 0 (^^) 

Write : ^ (ec„„te„t ^ ^)) (^®( 

That is, ’write’ is not simply a relation between two entities, but a relation 
between two specific types of entities. Note the importance of this step (of 
combining formal semantics with a rich type hierarchy), however. For example, 
(15) states that write{x,y) is well-typed as long as isa(x,Human) and 
isa(y, Content). In general, 

wellTyped(v{e)) =^j type{v,{e^ — > t)) a type{e,a) a isa(o, m) 
wellTyped{y{e-^,e^)) type{v,{e^ (e„ ^ t))) a type{e-^,a) a type{e^_,b) 

A isa(a, m) a isa(6, n) 



More importantly, however, the combination of a rich type hierarchy and a 
rigorous semantics should shed some light on the semantics of compound 
nominals. For instance, type information might explain why removing the 
middle noun form (16) changes the subject considerably while the same is not 
true in (17). 

Computer book sale (16) 

Information management system (17) 

Such rules are important in a variety of language processing tasks, and in 
particular in topic-based information retrieval. A compositional semantics that 
exploits a rich type hierarchy should therefore facilitate the development of a 
meaning algebra; for example to explain why fake gun is not (exactly) a gun, 
whereas imported gun is very much a gun. These are precisely the kinds of 
issues that have prompted this work, and much of this is currently under 
development. 



8 Concluding Remarks 



In this paper we argued for and presented a new approach to the systematic 
design of ontologies of commonsense knowledge. The method is based on the 
basic assumption that “language use” can guide the classification process. This 
idea is in turn rooted in Frege’s principle of Compositionality and is similar to 
the idea of type inference in strongly-typed, polymorphic programming 
languages. The experiment we conducted shows this approach to be quite 
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promising as it seems to have answered a number of questions simultaneously. 
In particular, the approach seems to (f) completely remove the need for 
multiple inheritance; (ii) provide a good model for lexical ambiguity and 
polysemy; and {in) suggest a plausible explanation of metaphor in natural 
language. Much of what we presented here is work in progress, more so than a 
final result. Therefore, we are well aware that it might be quite ambitious to 
expect this process to yield a complete classification in a strict binary tree (no 
multiple inheritance, and no lexical ambiguity). We must also note that a 
number of other aspects of this work were not discussed here, such as the part- 
whole relationship. In particular, it seems that some, but not all, verbs that 
apply to a concept apply to their parts. For example, grow in app{grow,leg) and 
app{grow,arm) is very much related to grow in app{grow, person) . That is, when 
we refer to a person growing, aging, etc. we are indirectly referring to the 
growing or aging of the parts. Another important part of this work is to also 
discover the nature of the relationship between (genuine) types (e.g., Human) 
and roles that concepts play (e.g.. Teacher, Father, etc.) In this regard a 
number of temporal aspects must also be formalized. 

A great deal of work is still needed to formalize the entire approach as well 
as work out the various inference rules that will eventually be needed in a 
natural language understanding system. We have already successfully used 
some of the ideas presented here in NLU tasks, such as developing an efficient 
and cognitively plausible inferencing strategy to resolve quantifier scope 
ambiguities at the pragmatic level (see Saba & Corriveau, 2001). While our 
immediate goal is to discover the ontology of commonsense knowledge, our 
ultimate goal is to build systems that can understand spoken language. This 
task has proven to be more challenging than has ever been imagined. Turing 
might have had it right all along: a machine that can converse in spoken 
language, must be an intelligent machine! 
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Abstract. IDS (Intrusion Detection System) plays a vital role in network secu- 
rity in that it monitors system activities to identity unauthorized use, misuse or 
abuse of computer and network system. Eor the simulation of IDS a model has 
been constructed based on the DEVS (Discrete EVent system Specification) 
formalism. With this model we can simulate whether the intrusion detection, 
which is a core function of IDS, is effectively done under various different con- 
ditions. As intrusions become more sophisticated, it is beyond the scope of any 
one IDS to deal with them. Thus we placed multiple IDS agents in the network 
where the information helpful for detecting the intrusions is shared among these 
agents to cope effectively with attackers. Each agent cooperates through the 
BBA (Black Board Architecture) for detecting intrusions. If an agent detects 
intrusions, it transfers attacker’s information to a Eirewall. Using this mecha- 
nism attacker’s packets detected by IDS can be prevented from damaging the 
network. 



1 Introduction 

As e-business being rapidly developed the importance of security is on the rise in 
network[l],[2]. IDS monitors system activities to identify unauthorized use, misuse or 
abuse of computer and network system[3],[4]. It accomplishes these by collecting 
information from a variety of systems and network resources then analyzing the in- 
formation for symptoms of security problems[4],[5]. 

Usually, the input data in simulation is abstracted from the actual intrusion. In this 
paper, however, we compose a real intrusion environment by generating non- 
abstracted intrusion packets and accordingly non-abstracted version of IDS core. An- 
other characteristic in the proposed simulation is the modeling of multiple IDSs which 
share attacker’s information to effectively detect the intrusion. 



2 DEVS Formalism 

The DEVS formalism, developed by Zeigler is a theoretical, well grounded means of 
expressing hierarchical, modular discrete-event models[6],[7],[8],[9]. In DEVS, a 
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system has a time base, inputs, states, outputs and functions. The system function 
determines next states and outputs based on the current states and input. In the for- 
malism, a basic model is defined by the structure: 

M = < X, S, Y, 5in„ §bxt, k ta > 

where X is an external input set, S is a sequential state set, Y is an external output 
set, 5in, is an internal transition function, 5ext is an external transition function, X is an 
output function and ta is a time advance function. A coupled model is defined by the 
structure: 



DN = < D, {Mi}, jl.}, {Zi,j}, select > 



where D is a set of component name. Mi is a component basic model, li is a set of 
influences of I, Zi,j is an output translation, select is a tie-breaking function. Such a 
coupled model can itself be employed in a larger coupled model. Several basic models 
can be coupled to build a more complex model, called a coupled model. A coupled 
model tells how to couple several models together to form a new model. 



3 Classification of Intrusion 

The intrusions used in the simulation are classified into three types according to the 
number of packets needed to detect the intrusion as shown in Table 1. The first type 
can be identified by analyzing one packet which contains one or more abnormal flags 
in packet’s header information and the second type by analyzing many packets like 
DoS (Denial of Service). The third type can be identified by analyzing packet’s data in 
which the attacker tries to acquire the privilege of system administrator using bugs of 
system[10]. 



Table 1. Classification of Intrusions 





One packet 
needed for 
detection 


Multiple packets needed for detection 


Packet header 
analyzed 


Packets header 
analyzed 


Packets data 
analyzed 


Attack 

type 


-probing 
■port probing 
■protocol probing 

-DoS 
■winnuk 
■X-mas tree 
■ping of death 


-scan -DoS 

■port scan ■ICMP flood 

■address scan ■web-port DoS 
■ftpd scan ■mail-port DoS 

■CGI-Query DNS DoS 

■mscan ■mailbomb 

■sscan UDP bomb 

■popd scan ■SYN Flood 


■door knob rattling 
■buffer overrun 
■password cracking 
■environment variable 
overflow attack 
■use commands which 
can be used by ad- 
ministrator himself 
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■land attack 


■imapd scan -smurf 


use commands which 






■trinoo 


are frequently used by 






■spam mail 


intmders 



4 The Structure of Target Network and Simulation Model 
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Fig. 1. The structure of target network 



Fig. 1 is the structure of the target network which has three subnets. The types of 
component models in the network are IDS, Firewall, Router and Gateway model[ll]. 
A IDS is loaded within each host and it cooperates with other IDSs in detecting the 
intrusion. 

Fig. 2 shows the structure of the model that are based on the network described in 
Fig. 1. The model is constructed based on the DEVS formalism. Each subnet has sev- 
eral ID models. Fig. 3 shows the structure of ID model within each host, its subcom- 
ponents and their interconnections. The subcomponent models are explained in the 
following subsections 





Fig. 2. The overall structure of network model 




Fig. 3. The structure of intrusion detection model 



4.1 PCL Model 

PCL (Packet Classify Library) model receives network’s packets that are generated by 
the intrusion generator model and classifies them according to Table 1. Then it filters 
sorted packets to reduce processing time as the following process. For example, for 
the mailbomb case, TA (Task Allocator) of PCL model receives packets from the 
generator model and then it transmits the packets to one of the three different types of 
models. These are MTONE, MTTWO, MTTHREE. If the packets, send to MTTWO, 
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are of TCP protocol and port number 25 then it transfers the packets to an agent 
model for further processing. Otherwise, MTTWO ignores it. 

Fig. 4 shows the model state transition diagram of several models. 






5e>t : external-transition function 
6ini : internal-transition function 



Fig. 4. Model state transition diagram 



4.2 AGENT Model 

The agent model is a rule-based ES (Expert System) which plays a core component 
role in detecting the intrusion. It transforms the packets that are delivered by PCL 
model into facts to be used by ES. ES inferences according to the facts thus generated. 
If a new attack is to be added to ID model later on, the administrator classifies the 
attack based on Table 1 and adds a proper subcomponent model to PCL model and its 
corresponding rules to the agent model. 



void MBRule : : Rulel (Slot Lists fact) { 
... if (protocol==6 ) Ptlld = true; ...} 




void MBRule :: Rule2 (Slot Lists fact) { 

... if ( Ptlld SS port==25) Prtid = true; . 




void MBRule :: Rules (Slot_Lists fact) { 
... if (Prtid SS Time==nowtime) 

if (timecount >= localThreshold) { 
S_add. insert ( Source_IP ); 
InAttack = true; . . . } 

...} 
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void MBRule : : Rule4 (Slot_List& fact) { 

... //check the buffer clearing time. ...} 

void MBRule :: Rules (Slot_List& fact) { 

... if(!BufCT && IsDanger()>= Minimal) 
InMin = true ; . . . } 



void MBRule :: Rule9 (Slot_List& fact) { 

... if (InSer && IsDanger ( ) >=Catastrophic) 

InCat = true ; . . . } 

Fig. 5. The rules of Expert System 

Fig. 5 shows the part of the rules of mailbomb attack. The Rule3, for example, re- 
ceives the facts and checks whether Ptlld is true and the value of the fact “Time” rep- 
resents the network is under attack currently. If the number of packets is more than 
“localThreshold”, Rule3 stores “Source_IP” and sets the variable “InAttack” to be 
true. Rule4 is the rule which checks buffers periodic clearing time. 



5 The Collaboration among the Security Models 



5.1 Communication among Agents of BB (Black Board) 




Fig. 6. Communication by Black Board 
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There have been large volume of research works done for detecting intrusions within 
distributed environment[12],[13],[14]. This section presents the mechanism in which 
IDSs communicate by BBA which is one of the architecture that allows collaboration 
among distributed agents[15],[16],[17]. 

BB (Black Board) in BBA, the field of the distributed AI (Artificial Intelligence), is 
hierarchically structured shared working memory through which the agents communi- 
cate by writing and reading the information relevant in detecting the intrusions. 

The hierarchy in BB is set according to Joseph Barrus & Neil C. Rowe[18] as 
shown in Fig. 6. They proposed Danger values to be divided into five different levels. 
The level in the BB is based on these divisions. These five BB levels are Minimal, 
Cautionary, Noticeable, Serious, Catastrophic. 




Message 

® write_request 
® write_permit 
® write_end 
(D read 
® read_end 

Action 

@ write 
(D read 



Fig. 7. Message of IDS and BBA in mailbomb attack 

Each agent communicate with two types of messages. One is the control messages, 
the other is the data messages. Since the agents insert intrusion related information to 
BB, each agent must request the permission to the controller for writing in order to 
manage consistency and contention problems. After writing is done, the agent sends 
write_end message to the controller. Controller reports this event to other IDSs. IDSs, 
which have read necessary information from BB, send read_end message to the con- 
troller. For example, transactions involved in mailbomb case are shown in Fig. 7. 



5.2 Communication among Agents of BB (Black Board) 

The IDS and Firewall system, being the major components of network security, coop- 
erate to enhance the security level. If IDS detects the intrusion through BBA, its agent 











Simulation of Network Security with Collaboration among IDS Models 



445 



modifies the security policy of the Firewall. So that the intrusion packets detected by 
IDS can be prevented. 

In order to reduce the damage of the network to a minimum level, we have pre- 
vented attacker’s packets from getting into the computer when BB is at beyond Serious 
level. When BB is at Serious level, the agent adds the source IP (Internet Protocol) 
address to the blacklist of the Firewall, then all packets coming from these source IP 
address are blocked. Fig. 8 shows that IDS detects an intrusion and responses ac- 
cording to this attack. 




Fig. 8. Intrusion detection and response 



6 Simulation Result 

We have executed simulations for two cases. One is the case for a single IDS to detect 
the intrusion, the other is the case for multiple IDSs to detect the intrusion by coop- 
eration. Mailbomb attack was used for the simulation in both cases. Mailbomb attack 
is a type of DoS attacks. It attacks by sending many mails to the mail server. For the 
generation of mailbomb packets, Kaboom version 3.0 is used. The intrusion detection 
time, false positive and false negative ratio are measured for the performance indexes 
in the simulation. 
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Fig. 9. Intrusion detection time 



As shown in Fig. 9, the selected Serious threshold value for the simulation are 40, 
60 and 80. The multiple IDSs detect the intrusion faster than single one does for all 
the threshold values. The faster the intrusion is detected, the earlier the administrators 
can correspond to the intrusion. It is important that the network administrator to re- 
spond at the early stage of the intrusion for the safety of the network. 




0 20 40 60 80 100 

Serious Threshold 



♦ single IDS ■multi IDS 



Fig. 10. False positive ratio 

Fig. 10 shows that the false positive ratio has been increased by the strengthening of 
the security level (lessening the threshold in our system). Fig. 1 1 shows the decrease 
of the false negative ratio as the security level is strengthened. In the figure, the error 
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ratio of multi IDS is lower than that of single IDS since the intrusions are detected 
based on the shared information. 




0 20 40 60 80 100 

Serious Threshold 



♦ single IDS ■multi IDS 



Fig. 11. False negative ratio 



7 Conclusion and Future Work 

As the usage of the network increases, intrusions occur more frequently and become 
more widespread and sophisticated. If multiple agents share the information with one 
another, the detection capability can be enhanced. The system which uses BBA for the 
information sharing can be easily expanded by adding new agents and increasing the 
number of BB levels. The cooperation between the Firewall component and IDS will 
provide added efficiency in the safe guarding the network. In the future, the generator 
model should generate the intrusion packets similar to real world packets, detailness 
and the simulation environment should also provide a proper set of threshold values 
according to the specific target system being modeled. Simulation results in this paper 
show that the false positive ratio of multi IDS is worse than that of single IDS and the 
performance of false negative is increased by lowering the Serious threshold level. 
The false positive ratio of multi IDS became worse with the lowering of the same 
threshold level. Therefore, the multi IDS performance of false positive case needs 
improvement in the further research of IDS core. The current simulation result, how- 
ever, still shows the performance improvement in multi IDS case since if one of the 
performance index has to be enhanced it should be false negative ratio than false posi- 
tive ratio in respect to the security. 
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Abstract. Over-the-telephone Large Vocabulary Spoken Dialog Systems have 
now become a commercial reality. A major obstacle to the uptake of the tech- 
nology is the effort required to construct spoken dialog applications, in particu- 
lar the grammars. To overcome this obstacle, a spoken dialogue toolkit has been 
developed that uses grammatical inference in combination with a templating 
technique to build transaction based services. As part of this development a new 
grammatical inference technique know as the "Lyrebird" algorithm has been de- 
veloped. Experimental results contained show that the Lyrebird algorithm out- 
performs the only other known algorithm for inferring context free attribute 
grammars. We also present the results of a comparison between the performance 
of the Lyrebird algorithm and an experienced speech application developer, 
showing that the algorithm creates grammars of a similar quality in a signifi- 
cantly reduced time. 



1 Introduction 

Developers of large vocabulary spoken dialog systems have available to them several 
commercially available speech recognition products. These products are available as 
stand alone applications or are integrated into Integrated Voice Response platforms for 
over-the-telephone applications. To build an application, the developer is required to 
specify the expected language and the dialog. 

The expected language is typically defined using a set of attribute grammars, with one 
grammar per question in the dialog. Attribute grammars [1] are used in large vocabu- 
lary speech recognition to 

1) improve speech recognition accuracy by constraining the expected language; 

2) attach meanings to phrases by attaching a set of key-value pairs to a parsed 
phrase. 

For instance, the expression ; 

i'd like to fly from melbourne to Sydney 
might be represented by the attributes : 

{ op=bookf light f rom=melbourne to=sydney} 
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Commercially available speech recognition development systems typically come with 
drag and drop GUI tools to enable the user to develop the dialogs. Developers use a 
graphical language based upon procedural flow to define fhe application. Although 
this style of programming has proven successful for tone based applications, it is un- 
suitable for nafural language applications due to the large amount of branching in 
mixed mode initiative spoken dialogues. The drag and drop interface in ifs currenf 
form is sfarfing to be replaced by finite state dialog managers due to the wide accep- 
tance of the VoiceXML standard. The VoiceXML standard has been designed as a 
means of accessing Internet content using speech recognition. 

In this paper, we describe a toolkit that uses a radically different concept for develop- 
ing spoken dialog systems. This toolkit starts from a simple description of fhe goals of 
fhe application and then learns from examples to improve the interface. We will also 
describe the grammatical inference algorithm in more detail. The performance of the 
algorithm is then compared to the Bayesian Model Merging Algorithm [2], [3]. Finally 
the results of a comparison between the Lyrebird algorithm and an experienced human 
application developer are presented. The results suggest that the use of fhe grammati- 
cal inference can significantly reduce development effort time without sacrificing 
grammar quality. 

1.1 Attribute Grammars 

Each symbol in an attribute grammar (terminal or nonterminal) has a fixed number of 
atfribufes with corresponding values. Attribute grammars contain copy rules which are 
attached to context free production rules, and assign an atfribufe value or a consfant to 
another attribute. 

Figure 1 below shows a simple attribute grammar capable of being inferred using the 
Lyrebird algorithm. The notation used in this paper is as follows: symbols beginning 
in upper case are nonterminals while those beginning in lower case are terminals. Top 
level nonterminals begin with a period. 

In our notation a nonterminal "Yl" can have a variable "x" attached to it denoted as 

" YLx". 

.S -> i'd like to fly S2:s { f rom=$s . f rom, to=$s . to, 
op=bookf light } 

S2-> from Location:! { from=$l . location } 

S2-> from Location: 11 to Location: 12 
{ from=$ll . location to=$12 . location } 

Location -> melbourne { location=melbourne } 

Location -> Sydney { location=sydney } 

Fig. 1. Example attribute grammar. 

The attribute values returned by a rule are defined using copy rules thaf are confained 
within curly braces. These copy rules can assign a value to the attribute returned by 
the rule either as a constant (e.g. location=melbourne) or by referencing the values of 
atfributes atfached to nonterminals (e.g. from=$l. location). In addition, the value of an 
atfribufe can be set to the result of a function fhat fakes one or more arguments that are 
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either constants or the value attached to a nonterminal (e.g. number=add( $nl. number 
$2. number) ). 

Methods for attaching attribute values to a phrase given an attribute grammar are 
described in [4] . In our algorithm we used a bottom up chart parser to create a syntax 
tree, and a parse stack to attach attributes to the phrase. 

Attribute copy rules can also be used to constrain the ways in which rules can be 
expanded. For instance when a noun phrase is attached to a verb phrase in a valid 
english sentence they should either be both singular or both plural. This interelation- 
ship can be implemented using attribute copy rules that are defined top down rather 
than bottom up. These top down attributes are known as inherited attributes, while 
those described in the notation used in this paper are known as synthesised attributes 
and are defined bottom up. 

Inherited attributes are useful for language generation systems. Although they can 
be used to deambigufy complex phrase structures, they are not commonly used in 
speech recognition systems, and therefore are not considered in this research. 



1.2 Inferring Attribute Grammars 

The grammatical inference of attribute grammars involves supplying to an algorithm a 
set of tagged phrases (observations) from which an attribute grammar is inferred that 
can generate not only the training data but other phrases similar to it. Although there 
is a substantial body of work on attribute grammars as well as grammatical inference 
of regular and context free grammars, there is very little work on the grammatical 
inference of context free attribute grammars for natural language phrases [5]. Stolcke’s 
[2] Bayesian model merging (BMM) algorithm is an exception. Both BMM and our 
algorithm can be described as minimum description length (MDL) algorithms. Other 
examples of MDL algorithms include the algorithms of Cook [6] and Griinwald [7], 
although these algorithms infer context free grammars rather than attribute grammars. 
The minimum description length principle can be considered to be a rearrangement of 
Bayes' law. With grammatical inference the aim is to find a suitable model (M) to 
describe a set of observed phrases (X). If there are a number of candidate models to 
select from then Bayes' laws states that you would select the model that maximised the 
probability P(MIX). Using Bayes' law. 

P(MIX) oc P(M)P(XIM) (1) 

Maximising this function is equivalent to minimising 

-log P(M) -log P(XIM) (2) 

Information theory tells us that -log P(E) is the optimal code length for communicat- 
ing an instance of E. Therefore -log P(M) can be seen as the optimal code length for 
describing the model (complexity), and -log P(XIM) can be seen as the optimal code 
length for describing the data using the model (discrepancy or cross entropy). The 
optimal solution may therefore be found by choosing the model that has the shortest 
total description length of both the model and the data [2] . In the absence of negative 
data the MDL principle can be used to indicate when generalisation should stop. This 
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is because the MDL is a hypothetical mid-point between the most restrictive model, in 
which only the training data can be generated by the grammar, and the most general 
model that can generate any phrase and therefore requires the training data to be de- 
scribed explicitly. 

MDL grammatical inference algorithms typically use a greedy search whereby a 
starting grammar is improved one step at a time using a cost function that describes 
the total description length. Candidate grammars are then compared for their effect 
upon the cost function, and the model that creates the greatest reduction in the cost 
function is chosen. If none of the candidate grammars leads to a reduction in the cost 
function generalisation stops. 

In both Stolcke’s [2] algorithm and Cook’s [6] algorithm a starting grammar that gen- 
erates the training data explicitly is created (i.e. each rule represents a number of 
identical observations). Progression from one candidate grammar to another is via a 
set of operators. These operators can be described as either chunking or merging. 
Chunking involves the identification of commonly repeated sequences (chunks). New 
rules are created that represent these chunks and a new rule is substituted into the 
existing rules. This adds hierarchical structure to the grammar without increasing 
discrepancy and commonly reducing the complexity of the grammar. 

Merging involves the merging of two non-terminals based upon patterns. When two 
non-terminals are merged, the grammar is often generalised. 

Some related research includes the work of Dulz et Al.[8] and Ross[9]. Dulz infers 
attribute grammars to model the performance characteristics of protocol implementa- 
tions but his work differs in that he infers regular grammars and attribute copy rules 
are used only to define average arrival times of protocol units. 

Ross uses attribute grammars in conjunction with genetic programming to infer sto- 
chastic regular grammars. He uses attribute grammars to ensure the validity of his 
inferred regular grammars, and in doing so he creates a more concise implementation 
of his code. He does not use attribute grammars to attach meaning to parsed phrases. 



2 Building Applications Using the Lyrebird Tool 

The Lyrebird tool has been designed to develop spoken dialog systems where the tasks 
to be performed are well defined, and a spoken natural language interface is required. 
In order to construct the application, the developer defines the tasks to be performed 
as a series of operations, along with the parameters required to perform the task and 
their types. For instance, in a stock broking application the developer might define that 
there are three tasks to perform, such as buying, selling and listing stock prices. Each 
of these operations have a set of slots that need to be filled. The "buy” operation might 
require a stockname, the number of stocks and the price of each stock. The types of 
the parameters can include predefined types, such as integers, money amounts and 
dates, or they can be defined as part of the application description as a list of items, 
such as stocknames, locations or products. Parameters are defined as either mandatory 
or optional. Mandatory parameters need to be specified explictly by speakers, while 
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optional parameters can be set to default values, or left unfilled. Applications that can 
be defined this way include voice commerce applications, bill paying, ordering, mes- 
saging and scheduling. 

From the application description presented to it, the Lyrebird tool builds an initial 
finite state machine dialog manager, along with a set of prompts and grammars. This 
initial application is a menu driven dialog in which speakers are prompted for each 
slot to be filled. Likewise the grammars are simple predictable responses to these 
questions and the prompts follow a simple pattern. From this simple description, a 
more sophisticated mixed mode initiative dialog can be learnt by supplying example 
phrases to the Lyrebird tool. 

For instance the developer might supply the phrase: 

"buy three hundred shares of acme dot com at the going 
rate" 

This might be represented by the attributes : 

{ operation=buy , stockname= "acme . com" , price= "market 
rate" } 

From this phrase the Lyrebird tool would be able to generalise the phrase to include 
other phrases describing the purchasing of different quantities of other stocks. With 
additional phrases, the Lyrebird tool will learn the equivalence of phrases such as "the 
going rate" and "market value". The Lyrebird algorithm also has the ability to learn 
prepositional phrases by identifying synonyms and complex phrase structures that 
describe structured types. It can also extend grammars that include integer types and 
concatenated strings. 

In addition to being able to generalise expressions, the Lyrebird tool typically does not 
require the developer to modify the dialog manager to accommodate the mixed mode 
initiative input. 

The approach taken by the Lyrebird tool, is to generate all of the regular parts of the 
application automatically using a templating approach, and to learn the irregular parts 
of the application through example. This technique requires the collection of phrases 
from a trial service to improve the application. This process of collecting example 
phrases from a trial however is commonly performed when an application is manually 
built so the grammatical inference technique results in a reduced development time. 



2.1 The Grammatical Inference Algorithm 

The Grammatical inference algorithm will now be briefly described. For a more de- 
tailed description of the algorithm the reader should refer to [10]&[1 1]. 

The Lyrebird Algorithm infers an attribute grammar from a set of tagged training data, 
plus an optional starting grammar. It does this in the following manner; 

1) An intermediate grammar is created that generates the training data (Incorpo- 
ration phase). 
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2) Hierarchical structure is then added to the intermediate grammar (Chunking 
phase). 

3) Generalisation then occurs through the merging of rules (merging phase). 

4) Redundant rules are then removed by reestimating rule probabilities and de- 
leting zero probability rules (Reestimation Phase). 

The process is then repeated until the grammar is as compact as possible, while still 
attaching the correct meanings to all of the training data. The algorithm does not use a 
minimum description length cost function explicitly to determine when to stop infer- 
ence. Instead it uses several inductive biases for adding hierarchical structure which 
have their basis in the MDL principle. The algorithm also uses a MDL checksum to 
prevent endless looping. 

The algorithm can use negative examples, but can equally operate without them. The 
algorithm can detect overgeneralisation, by detecting when training phrases are as- 
signed two or more meanings, only one of which corresponds to the meaning attached 
to it in the training data. 

During the incorporation phase the grammar is extended to include previously unpre- 
dicted phrases. When a starting grammar is supplied to the algorithm each training 
observation is parsed using a bottom up chart parser. In the case of a partial parse, this 
creates a small number of parse-trees, which return attributes. When such a parse-tree 
exists, and the attributes it produces exist in the observation, the phrase can be re- 
placed by a reference to the non-terminal, and the copy rules updated. 

For instance, if there was a starting grammar that included rules for describing num- 
bers and stocknames, and the observation "buy three hundred shares of 
acme dot com" was observed in the dialog state " . TopLevelStock” with the 
attributes "{ op=buy, stock= "acme . com" number=300 }" 
a rule of the form given below might be added. 

TopLevelStock -> buy Number :xl of StockName:x2 {op=buy, 
stock=$x2 . stock number=$xl . number} 

When there is no starting grammar, a new rule is created that reproduces the observa- 
tion explicitly. 

During the chunking phase, three different techniques are used to attach hierarchical 
structure to the grammar, only two of which are described here for the sake of brevity. 

The first method of generating hierarchical structure is to attach meanings to individ- 
ual words in the grammar. This is found by determining correlations between the 
attributes of observed phrases and the words within those phrases. To enable this 
problem to be solved the Lyrebird algorithm builds a class of attribute grammars 
where; 

1 . Each attribute has a type and copy rules can only assign values of the appropriate 
type to an attribute. 
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2. Copy rules cannot refer to functions, (i.e. they are all of the form x=$y.z or x=Z) 

3. The result of a copy rule is visible on all observations it generates. 

When these conditions hold true each copy rule may contribute zero or more attributes 
or attribute values. For instance in Figure 1 the rule 

-S -> i'd like to Fly S2 : s { f rom=$s . f rom, to=$s . to, 
op=bookf light } 

contributes the attribute op=bookf light to the phrase 
i'd like to fly from melbourne to Sydney 
The rule 

Location -> melbourne { from= melbourne } 
contributes the value "melbourne” to it. 

The contribution a rule makes to an observation can be described using an ordered pair 
notation as described in Table 1. 



Table 1. Attribute contribution notation 



Notation 


Name 


Meaning 


(f,v) 


Static contribution 


Contributes the attribute f with value v 


(*,v) 


Wildcard contribution 


Contributes the value v 



Consider the scenario of two observed phrases as follows: 

from melbourne to Sydney { from=mel, to=syd } 
from perth to melbourne { from=pth, to=mel} 

The first phrase could only be generated with rules with the following attribute contri- 
butions 

A1 = { (from,mel) , (to,syd), (*,mel) (*,syd) } 

This is defined as the attribute contributions of the phrase, and includes one static 
contribution and one wildcard contribution for each attribute. Similarly the attribute 
contributions of the second phrase is 

A2= { (from,pth) , (*,pth), (to,mel), (*,mel) } 

A list of possible attribute contributions of a rule being considered for chunking can be 
created from the intersection of all the attribute contributions of all the phrases that 
would be generated by it. 

For instance if we considered a rule of the form 

X -> melbourne 

Then given the two phrases above its attribute contributions would be contained in the 
set 
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A3= A1 n A2= { (*,mel) } 

This reduces the search space by eliminating impossible scenarios. For instance the 
new rule could not contribute (*,syd), otherwise the second phrase would contain this 
attribute contribution also. 

A copy rule can then be generated from this attribute, either from=mel or to=mel. 
After chunking this would give the grammar of figure 2 below. 



. S-> from X:x to Sydney { f rom=$x . from, to=syd } 

.S->from perth to X:x { from=pth, to=$x.from} 

X -> melbourne { from=mel } 

Fig. 2. Grammar after chunking 

It should be noted that, with sparse training data, correlations between phrases and 
attribute contributions can occur that would disappear with more data. For instance in 
the example above, the contribution (*,mel) is also attached to the words “from” and 
“to”. In the grammar shown in figure 2 however, the contribution (*,melb) can only be 
attached to one rule. 

The second way in which hierarchical structure is added to the grammar is by replac- 
ing repeated phrases by a reference to a new rule which contains the repeated phrase. 
This enables phrases of two or more symbols to be assigned consistent meaning 
throughout the grammar. The creation of these new rules is a form of compression, 
and the technique used in the Lyrebird algorithm, borrows some techniques from the 
Sequitur algorithm[12]. The Lyrebird algorithm uses a bigram table to count the num- 
ber of occurrences of two consecutive symbols in the grammar to ensure the most 
commonly occurring sequence of symbols are chunked first. Chunking the most 
commonly occurring sequence of symbols first gives the greatest reduction in the 
description length of the grammar without affecting discrepancy. 

During the merging phase, symbols are merged that can be considered to be equiva- 
lent. Merging reduces the complexity of a grammar and generalises it so that it can 
handle additional phrases. The merging phase uses a set of evidence patterns to deter- 
mine which non-terminals should be merged. An example evidence pattern along with 
its required merger action is shown in Table 2 below. 



Table 2. Example Merging Evidence Pattern 



Evidence 


Action 


X -> A B & 


Merge B and C,X -> A Y 


X-> AC 


Y -> B ,Y -> C 



If there is evidence for the merge, the merge is executed. Prior to the completion of 
merge, tests are applied to determine the suitability of the merge. The most critical test 
is that the grammar can still generate all of the training examples, and attach the same 
meaning to them. 
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After merging the algorithm removes redundant and less desirable rules, by attaching 
probabilities to rules and deleting zero probability rules. Probabilities are estimated 
using emperical relative frequency estimates [13] and expectation maximisation using 
the Viterbi parse [14]. 



3 Experimental Results 

Our algorithm was first tested on grammars used by Stolcke and included in the 
BOOGIE [15] software. The Lyrebird algorithm consistently outperforms the BMM, 
on these grammars and similar grammars. Figure 3 below shows the grammar cover- 
age of the two algorithms on a natural language date grammar where the difference is 
most noticeable. The training set included expressions such as 

march the twenty fifth of two thousand and five { 
month= "march" day=25 year=2005} 

& 

next Sunday the twenty fifth {day=25 
day_of_week= " Sunday " modif ier= "next " } 

The grammar contained 84 rules and could generate 197,150 possible phrases. 

The grammar was chosen because the correlation between some words and the attrib- 
utes attached to them are inconsistent. For instance; 

DayOrd -> second { day=2} 

DayOrd -> twenty second { day=22} 

A number of sample and test sets were generated from the grammars, such that the 
first hundred samples in the training sets included observations generated by each rule 
in the target grammar. Both tools were tested with training sets of various sizes. A 
phrase was considered to be in grammar if the grammar could generate it, and the 
attributes attached to the phrase in the training set were the same as those attached to it 
by the highest probability parse. 

Figure 3 show the results of these tests. The lines labelled "grammar coverage" in- 
dicate the percentage of phrases generated by the target grammar that were parsed 
with the correct meaning by the inferred grammar. The lines labelled "reverse gram- 
mar coverage" show the percentage of phrases generated by inferred grammar that are 
parsed with the correct meaning by the target grammar. These two measurements are 
equivalent to the concepts of recall and precision in information retrieval respectively. 
Of these two measures grammar coverage (recall) is more important, and speech rec- 
ognition accuracy will more closely track it as demonstrated in figure 4. 

Each point on figure 3 is the average of four tests. It can be seen that the Lyrebird 
algorithm significantly outperforms the Bayesian Model Merging algorithm for this 
data. With 700 training samples, the Lyrebird algorithm could produce 99.9% of the 




458 



B. Starkie 



phrases in the test sets on average. On some training sets the algorithm could generate 
a grammar with 100% coverage with as little as 500 training samples. 

The reverse grammar coverage plotted in figure 3 shows that the inferred grammar is 
slightly more general than the target grammar. All of the overgeneralised phrases 
inferred by Lyrebird were the result of the inference of the equivalence of "two thou- 
sand" and "twenty". This was due to the inclusion in the training data of phrases such 
as 

april the ninth of twenty eleven { month= " april " day=9 
year=2011) 

The grammars inferred by the Bayesian Model Merging differed significantly from the 
target grammar, and included large amounts of recursion. 



Figures. Lyrebird Vs Bayesian Modei Merging 




4 — 


- Lyrebird grammar 




coverage (recall) 


— ■— 


-BMM grammar coverage 




(recall) 




- Lyrebird reverse 




grammar coverage 




(precision) 




- BMM reverse grammar 




coverage(precision) 



Test Sample 



The performance of the Lyrebird algorithm was then compared against an experi- 
enced human developer. Two identical train timetable applications were built both 
manually and using the lyrebird tool. The two applications were then compared on, 

1) The time taken to develop the application. 

2) The quality of the application developed. 

The time taken to develop the application for the start of the learning curve test is 
shown in table 3. Application quality was measured using percentage of utterances in 
grammar (recall) and natural language speech recognition accuracy. Precision cannot 
be measured with real world natural language phrases because the target grammar is 
unknown. Twelve different speakers were given twelve tasks each (144 tasks). They 
then interacted with one of the prototype systems, to get the information they wanted. 
After they completed these tasks they were then encouraged to "try to say it all in one 
sentence", and "say it the way you (they) would like the system to understand". This 
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resulted in the collection of 1074 utterances, which were transcribed into text and 
divided into four sets. Both the manually crafted and automatically learnt grammars 
were then tested on one of the four test sets. The grammars were then modified using 
the previously seen utterances, and tested on another utterance set. This was repeated 
four times including the initial grammar. A learning curve was then plotted to show 
the predictive power of both the manually developed and automatically acquired 
grammars. This learning curve can be seen in figure 4. 



Learning Curve Lyrebird Vs Human 




— • — Grammar Coverage 
(Ly rebird) 

- - -A- - - Grammar Coverage 
(human) 

— ■ — Recognition Accuracy 
(lyrebird) 

— • — Recognition Accuracy 
(Human) 



Table 3. Development time Lyrebird Vs Manual (starting application). 



Lyrebird (hrs) Manual(hrs) 


Specification & Data Collection 


0.47 


0.47 


Dialog Manager and Prompts 


2.00 


21.5 


Grammar construction 


0.75 


9 


total (hrs) 


3.22 


30.97 



Conclusion 



This paper describes a grammatical inference algorithm created for use in a toolkit for 
the development of spoken dialogue systems, and compares its performance to both 
another algorithm for inferring attribute grammars, and an experienced human devel- 
oper. 

The comparison of the algorithm with Bayesian model merging illustrates both its 
improved performance, and the ability of the algorithm to infer grammars that tightly 
define the training data. The development of the attribute grammars for an application 
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is the most critical and time consuming task. In addition the work is not easily distrib- 
uted amongst more than one developer. 

The comparison of the algorithm against a human operator illustrates the algorithms 
ability to learn natural spoken language, in a shorter time frame than a human with a 
similar quality. 



References 

1. Knuth, Donald E., 1968, ’’Semantics of context-free languages.” Mathematical Systems 
Theory , 2(2): pp 127-45. 

2. Stolcke, A. and Omohundro, S. 1994, Inducing probabilistic grammars by Bayesian model 
merging. Grammatical Inference and Applications. Second International Colloquium, 
ICGI-94 , pp 106-18 Berlin: Springer Verlag. 

3. Stolcke, Andreas, 1994, Bayesian Learning of Probabilistic Language Models. Berkely 
CA: Univeristy of California dissertation. 

4. Fischer, Charles N. and LeBlanc,Richard J. Jr., 1988, Crafting a Compiler, Menlo Park 
CA Benjamin/Cummings,. 

5. Vidal, Enrique, 1994, Grammatical Inference: An introductory survey. Conference: 
Grammatical Inference and Applications. Second International Colloquium, ICGI-94, pp 
1-4 Berlin: Springer Verlag. 

6. Cook, Graig M.,Azriel Rosenfield, and Alan R. Aronson. 1976. Grammatical inference by 
Hill Climbing. Information Sciences 10.59-80. New York: North-Holland 

7. Grunwald, Peter, 1994, A minimum description length approach to grammar inference, 
Connectionist, Statistical and Symbolic Approaches to Learning for NaturalLanguage 
Processing , volume 1004 of Lecture Notes in AI, pp 203-16 Berlin: Springer Verlag. 

8. Dulz, Winfried and Hoffmann, Stefan 1991, Grammar-based Workload Modelling of 
Communication Systems, Conference: Computer Performance Evaluation. Modelling 
Techniques and Tools. Proceedings of the Fifth International Conference , pp 17-31 Am- 
sterdam: North-Holland 

9. Ross, Brian J. 2001, Logic-based Genetic Programming with Definite Clause Translation 
grammars. New Generation Computing (in press), 

10. Starkie, Bradford C., 1999. A method of developing an interactive system. International 
Patent WO 00/78022. 

11. Starkie, Bradford C., 2001. Developing Spoken Dialog Systems using Grammatical Infer- 
ence, Proceedings of the 2001 Australasian Natural Language Processing Workshop 
(ANLP 2001 ),pp 25-32 Maquarie University Language Technology Group. 

12. Nevill-Manning, Craig G., 1996 Inferring Sequential Structure , University of Waikato 
doctoral dissertation. 

13. Abney, Steven 1997, Stochastic Attribute- Value Grammars, Computational Linguistics , 
vol.23, no.4 , pp 597-618 MIT Press for Assoc. Comput. Linguistics 

14. Charniak, Eugene 1993, Statistical Language Learning, Cambridge, Mass. : MIT Press. 

15. Stolcke, Andreas, June 1994, How to BOOGIE: A manual for Bayesian Object-oriented 
Grammar Induction and Estimation, Internal memo. International Computer Science In- 
stitute. 




Towards Genetic Programming for 
Texture Classification 



Andy Song, Thomas Loveard, and Vic Ciesielski 



School of Computer Science 
RMIT University 

GPO Box 2476V, Melbourne Victoria 3001, Australia 
{asong.toml, vc}@cs .rmit . edu. an 



Abstract. The genetic programming (GP) method is proposed as a 
new approach to perform texture classification based directly on raw 
pixel data. Two alternative genetic programming representations are 
used to perform classihcation. These are dynamic range selection (DRS) 
and static range selection (SRS). This preliminary study uses four bro- 
datz textures to investigate the applicability of the genetic program- 
ming method for binary texture classifications and multi-texture classi- 
fications. 

Results indicate that the genetic programming method, based directly on 
raw pixel data, is able to accurately classify different textures. The results 
show that the DRS method is well suited to the task of texture classifi- 
cation. The classifiers generated in our experiments by DRS have good 
performance over a variety of texture data and offer GP as a promising 
alternative approach for the difficult problem of texture classihcation. 



1 Introduction 

Textural information is an important aspect of visual processing in both human 
and artificial applications. The capability of recognising and categorising textures 
is a key requirement to the use of textural data in a visual system, be it natural 
or artificial. 

Genetic programming (GP) has emerged as a flexible method of problem 
solving for a diverse range of complex problems [6] . In this approach a population 
of computer programs are evolved over a number of generations to perform a 
specific task, in a process analogous to natural evolution. 

One task that GPs have previously been applied to is that of classification. 
In classification tasks, each example to be classified consists of a feature vector 
pertaining to various attributes of the example, and a class label, indicating 
which, of a set of possible classes, the example belongs. A classifier is trained 
from a given set of examples where the class of the example is known, and 
the accuracy of the classifier can then be determined by the application of the 
resulting classifier on unseen data. In previous investigations GPs have shown 
to be capable of producing accurate classifiers in a variety of domains such as 
medical diagnosis [7,8], object detection and image analysis [11,13]. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 461—472, 2001. 
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Texture is observed as homogeneous visual patterns of scenes that we per- 
ceive, such as grass, cloud, wood and sand. The repetition of such patterns some- 
how produces the uniformity of sense which is very important for a observer to 
understand the scenes. 

The textural property of an image is one of the most informative cues for 
machine vision tasks, such as visual inspection, medical image analysis and re- 
mote sensing. However, there is no universally accepted definition of texture and 
no universal model to describe texture. Analysing texture information of images 
still remains a complex problem. 

Texture classification is one of the main areas of texture analysis problems, 
in which image textures are categorized into different classes and an observed 
image will be determined to belong to one of the given set of texture classes. 
The conventional method of texture classification involves obtaining an a priori 
knowledge of each class to be recognized. Normally this knowledge is some set of 
texture features of one or all of the classes. Once the knowledge is available and 
texture features of the observed image are extracted, then classical classification 
techniques , e.g. nearest neighbors and decision trees, can be used to make the 
decision [12]. The fields that texture classification has been applied to include 
the classification of satellite images [10], radar imagery [1], inspection [3] and 
content based image retrieval [5], 

In this paper we propose a new method towards texture classification prob- 
lems by the use of genetic programming in a one step process, directly based on 
raw pixel data. As a result the process of classifier production by this method 
does not require manual interruption or detailed domain knowledge. 

The aim of this research investigation is to explore the application of GP 
towards the texture classification problem, and to determine a GP methodology 
that can be appropriately applied to this complex domain. Additionally this work 
aims to provide support for the suitability of the GP approach to the texture 
domain and for applicability of the method in general. 

2 Methodologies 

2.1 Genetic Programming: Static Range Selection 

In the general case, genetic programs return numeric (real) values as program 
output, such as can be seen in the example program in figure 1. In previous 
works into GP for classification the real values returned by GPs have been used 
to interpret the class value by arbitrarily segmenting the range of reals. Each 
segment correspond to class labels. For this investigation we term this method 
of classification static range selection (SRS). 

For a two class (binary) classification problem the division point for the 
classes in SRS is generally made to be the zero point, which forms a suitable 
decision boundary for the two classes. A training example that results in a neg- 
ative output from the program will be classified class 1, while a non-negative 
result will be classified class 2. For problems with more than two classes, mean- 
ingful division points over the set of reals are not readily available, and the 
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choosing of arbitrary division points has been shown to produce less accurate 
classifiers [7]. For this reason the SRS method is seen to be suitable for only 
binary classification problems. 



Numeric Output 




Fig. 1. An example genetic program for classification 



2.2 Genetic Programming: Dynamic Range Selection 

An alternative to SRS for performing classification using genetic programming is 
the method of dynamic range selection (DRS) [7]. Similar to the SRS approach 
to classification, real values are returned by a GP which must then be interpreted 
as one of a possible set of class labels. In this instance however the ranges for 
class labels over the set of reals are determined dynamically, and can be different 
for each program. The segmentation of the range of reals is performed using a 
subset of the training data termed the segmentation subset. For any individual 
within the GP population, each data example of the segmentation subset is 
presented to the GP classifier and the output value is recorded, along with the 
known class label. In the interests of limiting complexity, in this investigation 
the output values are rounded to the nearest integer value, and output values 
greater then +250 are considered to be +250. Likewise values less than -250 are 
considered to be the value -250. 

Once all elements from the segmentation subset have been presented to the 
GP the output range [-250,250] is segmented into class labels. This segmentation 
is performed such that any given point over the range [-250,250] corresponds 
to the class label of the nearest output value obtained from the segmentation 
subset. The resulting range segmentation may thus contain multiple class labels, 
and multiple regions for any given class label. 
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The DRS method of classifier representation was shown to be capable of 
producing accurate results over a variety of datasets in the medical and image 
processing domains [7]. 



2.3 Parameters 

For comparison purposes, all experiments are run under the same conditions. 
All runs consist of a population size of 200 individuals. The termination criteria 
for each run is either perfect classification on the training set (training accuracy 
was 100%), or after 200 generations has been processed. When moving to a 
new generation, the roulette wheel selection is applied. The crossover operation 
accounts for 90% of the breeding pool and elitist reproduction is used for the 
remaining 10%. The mutation operator is not used. The programs are generated 
with an initial maximum depth of 6, with overall program depth limited to 17. 

For both DRS and SRS classifiers strongly typed genetic programming [9] is 
used to develop classifiers. The function and terminal set for the runs can be 
seen in Table 1 and Table 2 respectively (both the function and terminal set 
remain the same for the DRS and SRS methods). 



Table 1. Function Set 



Name 


Return Type 


Argument Types 


Description 


Plus 


Double 


Double, Double 


Arithmetic addition 


Minus 


Double 


Double, Double 


Arithmetic subtraction 


Mult 


Double 


Double, Double 


Arithmetic multiplication 


Div 


Double 


Double, Double 


Protected arithmetic division 
(divide by zero returns zero) 


IF 


Double 


Boolean, Double, Double 


Conditional. If argl is true return 
arg2, otherwise return arg3 


<= 


Boolean 


Double, Double 


True if argl is <= arg2 


>= 


Boolean 


Double, Double 


True if argl is >= arg2 


= 


Boolean 


Double, Double 


True if argl is equal to arg2 


Between 


Boolean 


Double, Double, Double 


True if the value of argl is between 
arg2 and arg3 



Table 2. Terminal Set 



Name 


Return Type 


Description 


Random(-l, 1) 
Attribute [x] 


Double 

Double 


Randomly assigned constant between -1 and 1 
Value of attribute x 
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2.4 The Datasets 




Fig. 2. The four textures (a)French Canvas(D21) (b)Soap Bubbles(D73) (c)Cotton 
Canvas(D77) (d)Plastic Bubbles(D112) (From Left to Right) 



Four kinds of texture from Brodatz album [2] are used to test the classifica- 
tion performance of genetic programming method. They are French canvas(D21), 
Soap bubbles(D73), Cotton canvas(D77) and Plastic bubbles (D 112). All these 
textures are grey level images, of the value from 0 to 255 (See Figure 2). 

For each class of texture, 400 distinct sub-images are sampled from the origi- 
nal 640 X 640 picture. These sub images can be considered as the texture elements, 
called texels [4]. To determine the size or appropriate resolution of texels could 
be an extensive research topic itself. In our work, we simply use 16 pixels by 16 
pixels as the sampling size. 

2.5 Fitness Function 

The fitness value measures the performance of the generated program in terms 
of the ability to solve the problem. The fitter individual programs have a prob- 
abilistically greater opportunity to pass genetic material to future generations. 

The fitness measure for texture classification is straightforward. The perfor- 
mance can be determined by the success made by the program, which is, in this 
case, the classification accuracy (the percentage of the cases that have been cor- 
rectly classified). As a result the fitness value can be expressed by the following 
formula: 



/ 



TP + TN 
TOTAL 



X 100% 



( 1 ) 



Where TP is the true positive rate, TN is the true negative rate and TOTAL 
is the total number of the cases. 



2.6 Classification Accuracy 

To evaluate the classification accuracy of the methods trialed, ten fold cross 
validation is applied. In experiments using SRS nine folds of data are used in 
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training and one is used as test (unseen) data. In experiments using DRS one 
fold is used as the segmentation subset, eight folds are used for training and one 
fold is used for test. 

Due to the probabilistic nature of GP systems, results can vary from run to 
run. To reduce the variation of results due to random sampling, all experiments 
are run ten times and the final accuracy for training and test are given as an 
average of these ten runs (giving a total of 100 independent GP runs for each 
experiment). 

3 Experiments and Results 

3.1 Binary Classifications 

The first stage of this investigation involved the use of a GP classifier to differ- 
entiate one texture from another. Each run selected two from the four textures 
used until all possible combinations were exhausted. The resulting dataset for 
each run consisted of 800 examples (400 examples from each texture class) . The 
results of these experiments were shown in Table 3. 



Table 3. Classifying Two Textures: Classifier Accuracy (Test Data), average of 10 runs 





D21 vs 
D73 


D21 vs 
D77 


D21 vs 
D112 


D73 vs 
D77 


D73 vs 
D112 


D77 vs 
D112 


SRS 


00 


80.74% 


87.72% 


71.73% 


87.86% 


80.72% 


DRS 


91.73% 


89.59% 


98.24% 


84.35% 


99.75% 


99.11% 



3.2 Multiclass Classifications 

Binary problems are rare in the domain of texture classification and to account 
for this we extended the GP approach to multiclass classification. 

Five additional experiments, shown in Table 4, used all 1600 samples from 
the four textures. The first four experiments were the extension of binary clas- 
sification, in which one texture was labelled as 0, while all other textures were 
labelled as class 1, resulting in a binary classification problem. In contrast, the 
texture samples used in the fifth experiment were labelled from 0 to 3, in which 
four classes were included. 

Both methods, SRS and DRS, were applied in the first four experiments. In 
fifth experiment, only DRS was used as the SRS is considered to be unsuited to 
the method of multi-class classification. 

For the experiment classifying the texture D112 against the other three tex- 
tures we also present, in Figure 3, the progression of classification accuracy over 
generations for the DRS and SRS methods. The accuracy figures are given as 
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Table 4. Classifying Four Textures: Classifier Accuracy (Test Data), average of 10 
runs 





D21 vs 

D73 D77 D112 


D73 vs 

D21 D77 D112 


D77 vs 

D21 D73 D112 


D112 vs 
D21 D73 D77 


All Four 
Classes 


SRS 


90.0% 


77.4% 


79.7% 


85.9% 


- 


DRS 


91.8% 


85.1% 


81.7% 


99.36% 


71.82% 



100 



7 



b at) H 



-- sns Training 
■■■SRS Test 
— DRS Training 
DRSTest 



'00 

Generation 



Fig. 3. Accuracy of Best Individuals over Generations: D112 against D21 D73 D77 



the best individual (based on training fitness) of each generation for the run. 
Figures for both training and test accuracy for such individuals are presented, 
although it should be noted that these figures are given as the average over 10 
runs. 

Additionally, for this experiment (D112 vs D21, D73, D77) Figure 4 showed 
the trend of test accuracy over the ten runs. The top line represented the highest 
test accuracy in the ten, the bottom line showed the lowest, while the middle 
represented the mean. 



4 Discussion and Analysis of Results 

In the application of the two methods of genetic programming to the texture 
classification task it can be seen that classifiers of a high degree of accuracy were 
produced, particularly when the DRS method of training was employed. The 
results shown in Figure 3 indicate that the DRS method not only had a higher 
accuracy than SRS, but also converge at a faster rate than the SRS method. It 
can be seen that in this experiment the average classification accuracy of the DRS 
method reached 99% in 40 generations. Subsequently, from generation 40 to 200, 





(+ (+ -0.771955 (* (/ -0.0752668 -0.771955) (+(+(+(/ -0.0752668 
-0.820847) (+ ATTR39 (- ( + ATTR91 ( - ATTR140 (/ ( + (if ( >= -0.552571 
-0.704058) (+ ATTR215 0.0654748) (- ATTR35 ATTRO)) ATTR84 ) ATTR247 ) 
) ) ( - ( - 0.624757 0.140204) ATTR201 )))) ATTR167 ) (+ ATTR39 (+ ATTR8 
ATTR30 ))))) ( ♦ ( / -0.0752668 -0.820847) (+(-(+(+ ATTR42 ATTR167) 
ATTR59) (- (- 0.624757 0.140204) ATTR201)) ATTR85))) 

Ranges: Class 1: D112 : -213 ~ +250 
Class 2: Other Textures : -250 ■■ -214 



Fig. 5. A program generated by the DRS method which can achieve 100% accuracy 



the accuracy was only very slighted improved. In the ten runs of the experiment, 
there were four runs which were able to achieve 100% accuracy on training data 
and terminated within 40 generations. In comparison, classifiers trained with the 
SRS method converged at a slower rate, and were never capable of exceeding the 
accuracy achieved by the DRS method. This pattern is consistent with results 
in all other experiments run in this investigation. 

Figure 3 shows the progression of training and test accuracy for both the SRS 
and DRS. It can be noted that for both DRS and SRS, the curves of training 
accuracy and test accuracy are very close, which indicates that the classifiers 
generated in the training process generalise well to unseen data. An additional 
point of interest is that, even after many generations of near perfect classification, 
the DRS method did not tend to begin the over-train on the training data. If 
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(* -0.208533 (if (n= (- ATTR39 ATTR69) (+ ATTR114 0.299995)) (- ATTR222 
0.253174) ( + ( + (if (Between ( * ATTR189 ( + ATTR205 ATTR74)) ATTR164 
ATTR12) ATTR108 ( if (Between ( - (* 0.253174 ATTR129) 0.9875) 0.128228 
(/ ATTR189 0.940897)) ( + ATTR39 (if (Between -0.947054 ATTR164 ATTR12) 
ATTR160 ATTR124)) (* 0.5979 0.230855))) ( + ( + (if (Between (+ ATTR205 
ATTR74) ATTR164 ATTR12) ATTR108 ( if ( Between -0.947054 ( if ( Between 
-0.947054 ATTR164 ATTR12) ATTR160 ATTR124) ATTR12) ( + ATTR39 (if ( 
Between -0.947054 ATTR164 ( - (/ ATTR149 ( if (Between -0.947054 (if ( 
Between -0.947054 ATTR164 ATTR12) ATTR160 ATTR124) ATTR12) ATTR108 
0.135714)) (* ATTR164 ATTR129))) ATTR160 ATTR124))(* 0.5979 0.230855))) 
(if (Between (* ATTR189 (+ (if (Between (- (* 0.253174 ATTR129) 0.9875) 
0.128228 ( / ATTR189 ( - 0.987674 0.747389))) ( + ATTR39 ( if ( Between 
-0.947054 ATTR164 ATTR12) ATTR160 ATTR124)) ( * 0.5979 0.230855)) 
ATTR74)) ATTR164 ATTR12) ATTR205 ( + ( + ( if (Between (* (.* ATTR189 (+ 
ATTR205 ATTR74)) ( + ATTR205 ATTR74 )) ATTR164 ATTR12 ) ATTR108 (if ( 
Between (- (* 0.253174 ATTR129) 0.9875) 0.128228 (/ ATTR189 (- 0.987674 
0.747389))) ( + ATTR39 (if ( Between -0.947054 ATTR164 ATTR12) ATTR160 
ATTR124)) ( * 0.5979 0.230855))) ( + ( + ATTR160 (if (Between -0.947054 
ATTR164 ATTR12) ATTR160 ATTR114)) ( * 0.253174 ATTR129))) ATTR205))) 
(* 0.253174 ATTR129))) ATTR205))) 



Ranges: Class 1: D112 : -250 ~ +146 

Class 2: Other Textures : +147 ~ +250 



Fig. 6. Another program generated with the DRS method 



this had occurred the test error rate in Figure 3 would be seen to decline in later 
generations, which is not the case here. 

In this same experiment using DRS methods for distinguishing D112 with 
other three textures, ten classification programs were created at the end of each 
run (one for each fold of test data). Fig 5 and 6 are two examples of such 
programs. These programs were selected from different runs and were the best 
individuals of their run. These two classifiers achieved 100% training accuracy 
and 100% test accuracy before 200 generations. Each program has a separate 
range of values corresponding to class labels, which are generated by the dynamic 
selection process. In the case of these programs it appears that a single threshold 
value is sufficient to accurately classify the textures. However it is possible that 
some textures could require more than one threshold e.g. -250 to -100 with 100 
to 150 being the range of one class, and -99 to 99 with 150 and above being 
the range of another class. For such a multi-threshold problems it is certainly 
most difficult for the SRS method to achieve high classification accuracy and 
easy convergence. 
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Although these two programs in Fig 5 and 6 have good performance, they 
are in relatively simple forms. Program 1 only uses IF and one >= function 
(which are actually redundant as they compare two constant values) and a com- 
bination of arithmetic functions. The program only utilises 17 attributes, which 
corresponds to 17 pixels of the 256 pixels of texel input images. Program 2 looks 
much more complicated than the programl, but it requires only 20 pixels of 
the 16x16 raw image as the input for its 20 attributes. It can also be noted 
that the set of input attributes used by each program differ substantially. All 
programs from runs that were assessed made use of a different set of attributes 
and adhered to a variety of associated ranges. This variability in the approach 
to classification can be seen as an advantage for the further generalisation of the 
method. Having a flexible and varied approach from run to run should allow the 
method to be applied to a wide variety of texture classification problems. 

The four textures used in this investigation have varied levels of difficulty for 
classification. The accuracy of classifying the two textures D73 and D77 is the 
lowest of the two texture classification problem, at 84.4%. With the addition of 
the two other textures, the accuracy didn’t change by anything but a relatively 
small amount. The lower accuracy in this problem may be partially due to the 
size of sampling window (16x16), which is possibly not large enough to extract 
the information needed to distinguish between these two textures. In contrast 
it appears that the texture D112 is relatively simple for genetic programming 
to classify, particularly with the DRS method. This would indicate that the GP 
method is able to classify textures, although the accuracy of this classification 
is dependent upon some further factors. 

In an attempt to verify that the GP systems were utilising meaningful tex- 
tural information from the problem, and not simply memorising attributes of 
the training dataset, four additional experiments were conducted in which the 
each of the four textures was classified against itself. For all four textures, the 
training accuracy obtained from both DRS and SRS method was around 50%. 
The inability of the method to distinguish one texture from that same texture 
indicates that the accuracy of the classifier is based on the texture information 
present in the texels, rather than any specifically remembered elements of the 
data. Additional support for this is found in the fact that test and training error 
rates remain similar, even when large periods of training are given. 

One interesting consideration of the four multi-class experiments where one 
texture was classified against all other textures is that the generated classifiers, 
with a high degree of classification accuracy can potentially be considered as the 
texture description function for a particular texture. According to our results, 
such classifiers are relatively small in size and do not require all pixel information 
of a texture image. Moreover, such classifiers are able to directly work on pixel 
data, indicating that the conventional texture feature extraction process is not 
necessary in the genetic approach. 
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5 Conclusions and Future Work 

The aim of this study was to explore the genetic programming paradigm towards 
classification, and in particular, the difficult problem of texture classification. 

In this work two approaches of representation, static range selection (SRS) 
and dynamic range selection(DRS), were investigated. The results from all the 
experiments indicate that DRS can generate a more accurate classifier compared 
to SRS. Moreover, DRS has quicker convergence, which means a classifier with 
higher performance can be generated more quickly by DRS than by SRS. 

This work also proposed genetic programming as a new method towards 
texture classification. The results indicate this method was able to classify tex- 
ture images directly based on raw pixel information. Within the limited training 
process, some texture could be classified to a near perfect degree of accuracy. 

In the future work, the investigation will involves the inclusion of more tex- 
tures, using larger population and more training generations, using different 
sampling window size and resolution, the inclusion of some simple features such 
as local statistical data or simple feature extraction functions into classifiers. 
The investigation will also address the understanding of the generated pro- 
grams, which could answer why the genetic programming approach has good 
performance. Analyzing of these programs could also be helpful to understand 
textural properties of images. This study could further improve the applicability 
of the genetic programming method to the texture classification problem. 
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Abstract 

The use of model-based diagnosis techniques for software debugging has been 
an active research area for several years. This paper describes the extension of 
model-based debugging by the utilization of object-oriented design information 
for the identification of structural faults. The typical structural software fault is 
the incorrect assignment, both a frequent and hard to identify problem if no extra 
information about the fault is present. We analyze the different types of faults, 
use heuristics about pre- and postconditions to infer missing or additional state 
variable assignments, and use statechart diagrams as additional constraints over 
the permissible method execution sequences. 



1 Introduction 

Detecting, locating, and repairing faults in software is a difficult and time consuming 
task. Detecting an incorrect behavior of a given program is done by using testing tech- 
niques or formal verification methods, e.g., model checking [CGL94]. Whereas much 
effort has been made on test theory, test methodology, and algorithms for automatic 
test-case generation, somewhat less work has been published on locating and repair- 
ing software faults. Because debugging is not only performed in the implementation 
and test phases of a project, but also in maintenance, saving debugging time natu- 
rally results in saving time and money over the whole product life cycle. Especially in 
maintenance, where the original developers may no longer be involved, debugging is 
very costly. An automated debugger for locating and fixing faults can help in such a 
situation. 

Automatic debugging approaches introduced in the past include program slic- 
ing [Wei84], algorithmic debugging [Sha83], dependency-based techniques [KR97, 
Jac95], probability-based methods [BH95], and others. These traditional approaches 
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are either specific to a programming language, use specialized algorithms, or re- 
quire explicit user-interaction to locate a bug. In order to overcome these draw- 
backs and to improve the results of the abovementioned approaches, the use of model- 
based diagnosis (MBD) for debugging was suggested [CFD93]. Model-based diagno- 
sis [Rei87, dKW87] provides a general theory for diagnosis that has sucessfully been 
applied to various engineering areas. 

We have previously described the use of a model of Java programs that is based on 
recording dependencies between variables [MSWOOa]. We convert the Java program 
into a logical description that afterwards is used together with a model-based diagnosis 
engine [Rei87, dKW87] for computing diagnoses, i.e., finding possible bug locations 
given an observed incorrect execution outcome. An alternate model that was based on 
the actual values being computed for a given test case was described in [MSWOOb]. 

In Model-based Diagnosis [Rei87], the structural or bridge fault has been justly 
considered as a special problem that requires special methods to be effectively solved. 
The original bridge fault, as described by Davis [Dav84], occurs in circuits where a 
short circuit exists across multiple lines and thus forces them to an identical state, de- 
spite the fact that they are not functionally connected according to the original descrip- 
tion of the artifact. Such a fault can therefore only be handled by the use of a separate, 
second model that allows to identify special situations (such as physical adjacency in 
the case of the brigde fault) which can then be examined more closely. This approach 
was generalized by Boettcher [B6t95], where the notion of physical layout was con- 
verted into an abstraet context that activated additional model fragments suitable to 
capturing specific types of what was appropriately called ’’hidden interactions”. 

Applying model-based diagnosis to software emphasizes the necessity to come to 
terms with structural faults, due to the different nature of buggy software - by definition 
an unfinished artifact for which unlike in hardware diagnosis a complete description 
cannot even exist. As mentioned, above, our recent work has developed diagnosis mod- 
els applicable for imperative programs written in the Java language at different levels 
of abstraction: dependency-oriented [MSW99], i.e., the model used for diagnosis only 
expresses dependencies between variables, and parts of the program that are assumed 
to be correct propagate correct values through the system until they contradict observa- 
tions of bugs, at which point diagnoses can be computed in terms of components that 
are assumed to be incorrect; and value-oriented [MSWOOa], where the semantics of the 
language are modeled in detail so that effectively the complete execution of a program 
is simulated and examined dynamically. Both approaches, as the authors note, are of 
limited applicability with regard to structural faults, although this issue was examined 
in [SW99], where structural faults were explicitly addressed by considering name mis- 
spellings, variable switchings, or searching for repair expressions (i.e., synthesizing 
missing parts) to provide correct functionality. This paper takes a different approach in 
that it uses available sources of information for focusing on structural faults. Nonethe- 
less it should be noted that the focus is still on diagnosis, i.e. error location, compared 
to verification approachs, which deal with error detection and by means of checking 
whether the program formally satisfies external special purpose specifications. 

The rest of the paper is structured as follows: We first examine the different types 
of structural errors that can occur in software and show how their influence on a 
dependency-based diagnosis model. We then examine the use of object-oriented de- 
sign documentation in augmenting diagnosis models. First we consider contracts and 
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give an example of how their incorporation in the diagnosis process can exclude diag- 
noses, thus improving the process. We then examine the use of high-level models such 
as UML statecharts. The paper closes with a discussion of related results. 



2 Structural errors and software diagnosis 

Structural faults are faults that do not occur because a component is functioning in- 
correctly, but because there is a missing or additional connection between two compo- 
nents, as in a bridge fault in electrical engineering. More than in traditional domains 
it is relevant when diagnosis is applied to designs, and in particular, software designs 
(which Davis mentioned as one particular area in his article). The use of an incorrect 
argument in an expression (e.g., by using a different variable name, switching the or- 
dering of arguments), or the omission of part of a complex expression constitute typical 
examples of such faults. 

In conventional model-based diagnosis, the system description is an exact specifi- 
cation not only of the overall behavior of the system, but of its individual parts. For 
example, when diagnosing the hardware implementation of a 16-bit adder, the adder’s 
system description will describe the behavior of the logical gates from which the adder 
is composed. A fault is assumed to occur because one of the components does not act 
according to its specification. 

In diagnosing a program, the assumption that the specification will be a complete 
representation of the structure of the artifact is invalid. The internal structure and the 
way in which the behavior is described will differ widely between a specification and its 
implementation in a programming language - the implementation will usually contain 
many auxiliary variables and data structures and signals which have no explicit coun- 
terpart at all In the functional specification. The only ’’part” of the specification that 
is directly usable are, generally, test cases which are produced manually (sometimes 
by software tools). One is therefore forced to base the model of the implementation 
on analysis of the code of the implementation itself. That implies, however, that it 
is the model that reflects the incorrectness of the design and whose output (the im- 
plementation trace) is confronted with observations that are correct (the specification 
trace), whereas in traditional diagnosis problems, the model is correct and it is the ob- 
servations, made from the behavior of the actual system, that reflect on the incorrect 
behavior. 

The usual way for dealing with structural faults is to assume the existence of a dif- 
ferent, complementary model that allows to reason about the likelihood of such faults 
(i.e., modelling of spatial neighbourhood in the case of bridge faults). Before we con- 
sider this issue we first want to examine structural errors more closely. 



2.1 Classifying Structural Errors 

We consider the execution of a sequential program P to be represented by a sequence 
pi, ... ,Pn of states of its variables. We write pi < pj if i < j. Note that the set of 
variables in the program is not constant over time since local variables will be created 
by entering subroutines (methods in Java) and instance variables will be created when 
objects are created. 
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We first want to examine how structural errors make themselves felt in the code 
for detection purposes. In hardware diagnosis, methods that attempt to solve struc- 
tural diagnosis problems are generally applied only to diagnosis candidates postulating 
multiple faults. 

For the purpose of this discussion, we do not make a distinction between different 
types of variables (local, global, static, instance or class variables, or reference parame- 
ter to methods). We basically examine independent (local or static variables) and, later, 
as part of our example, variables that are associated with a particular object (instance 
variables). We call an incorrect assignment to a variable x masked at a point p in the 
execution of program P if the variable had its value overwritten by a different assign- 
ment without being accessed, or if the scope containing the variable was left/destroyed 
without the variable being accessed again. We write that x is masked* in p if there is a 
sequence of variables x\ = x, . . . s.t. Xi+i depends on Xj, but x„ is masked in p. 
We write that a variable x is observed to be incorrect in a particular state p of P if the 
value of X in p is incorrect. We write that a variable x is felt incorrect in a state p of P 
if X is not masked* in p. 

We examine the different types of structural errors that may result from as- 
signments. Note that other types of assignment than assignment statements ex- 
ist, e.g., use of reference parameters in a subroutine, just as it is possible to ac- 
cess a variable outside a statement (e.g., in a loop condition). However, these 
cases can in general be expressed in terms of an assignment statement simply 
in terms of postulating the introduction of an auxiliary variable. E.g., a call 
f indName { name , user2 ) that is supposed to alter the value of name by reference 
can be replaced by the sequence f indName (aux, us er2 ) ; name = aux , 

or a condition if i < j can be written as aux = j ; if i < aux . The 
only situation where this is not possible is a variable access in a loop condi- 
tion: z = 0; while (i < z) {(body); z = (some expression)} would re- 
quire both assigments to z replaced by assignments to aux and therefore lose the 
property of having only one position changed. 

We do omit intricate situations that are possible in expression-oriented languages 
like C or Java, where assignments are expressions too and therefore can be used inside 
other expressions, since in usage is hard to understand for the programmers themselves 
and therefore frowned upon*. The exception are for loops, but we omit these for 
space reasons, since for theoretical purposes they can be simulated by while loops, 
although a specific treatment will be more effective in practice. Finally, we do assume 
that the variables and constants involved in the errors are type compatible, since that 
type of error can routinely be caught by current compilers. 

2.2 Structural assignment faults 

For illustration purposes, we examine the simple case of an incorrect assignment state- 
ment. The first case is the classical case of a statement that has been inadvertently 
added. 

'One could argue that a debugger could then help here a lot, which is true, but then we do not claim 
that being better at semantic understanding of programs than programmers possible with current technology. 
Some situations are just too hard. 
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Assignment added: x=z; is contained in the program, but should not be. 

In all other cases we consider a variation of an intended correct assignment state- 
ment s with textual representation x = y ; in program state p. This is a stand in for 

the whole class of statements of type x = {exp) where {exp) is a complex expres- 
sion containing an access of y. Given that s = x = y; is the statement that should 
actually be contained in the code, we have basically five possible erroneous statements 
s' that could be considered to be produced instead of s by ’’atomic” errors. The first 
four assume that y is a variable. The fifth assumes that y is a named constant or a 
literal (we refer to both cases simply as ’’constant” from here on). 

Wrong target: s' = z=y ; where z is a different variable. 

Wrong source: s' = x=z ; where z is a different variable. 

Constant source: s' = x=c ; where c is a constant. 

Switch: s' = y=x; . 

Variable source: (Remember that y is a constant in this case.) s' = x=z ; where 
z is a variable. 

Assignment omitted: s is not contained in the code. 

We will refer to these cases by their abbreviations: WT, WS, CS, S, VW, A A, AO. 
Regardless of the type of error present there must be some test case such that the value 
assigned to x will be incorrect (or we are not dealing with an error but have merely 
uncovered a case of redundant data storage in the supposedly correct program). 

The WT and WS case remove one dependency between variables in the program 
state and add another. The AO and VS case add one, while the CS and AO case drop 
one and the S case reverses the direction of one. 

Note that if the WT, WS, and S case result in z being incorrect in program state 
T > s'(p) , then the value of x in s'{p) is also incorrect and if x is not masked*, x will 
also be felt incorrect in t. 

The WS, CS, VS, and AO case result in only x being incorrect in program state 
s'{p). An observation in s'{p) will therefore identify x as a potential single fault can- 
didate. 

As can be seen, examining single test cases does not necessarily result in multiple 
fault candidates. The identification of structural faults with multiple fault candidates 
only applies in domains where the structural faults have the semantics of bridge faults 
in the original model, namely that both ends of the new connection are influenced. A 
positive effect of this is that a dependency- or value-based model will still propose 
s' as one of its set of single fault diagnosis candidates. A negative effect is that it 
means the standard recognition method from hardware diagnosis is not applicable. The 
counterpart of, say, a VS fault in hardware diagnosis would be that of being shorted to 
ground in a circuit domain, or a tank leaking into the outside world in the domain 
described in [Bot95]. But even a tank leaking to the outside world would still result 
in a multiple error if no explicit leak fault mode were present (because it would imply 
a fault in all valves and pumps downstream of the tank, or perhaps in the absence of 
physical impossibility constraints, the water being sucked out of the tank by the input 
pipe). 

In software, on the other hand, we cannot make that assumption since auxiliary 
information is constantly created and destroyed (e.g., when a scope containing auxil- 
iary local variables is lost). Not only do we need a separate model level to localize the 
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Figure 1 : ATSCC Class Diagram 



faults, we do need a separate model level to better hypothesize about their existence in 
the first place. 

3 Utilizing Design Documentation 

Thankfully, there is a potential source of additional information that we can use if only 
we can relate its semantics to the type of problem we are facing. The development 
of object-oriented software led to the development of methods for designing object- 
oriented software, and a central issue of these methods generally consists of the pro- 
duction of design descriptions for documentation purposes, both during development 
and for maintenance. 

Example: Consider a strongly simplified version of a real-world Air Traffic Service 
Control Center (ATSCC) which controls the traffic flow across an intersection point 
among a group of air routes intersecting, controlling and tracking the aircraft until 
they leave the immediate vicinity of the ATSCC along one of the incident routes. For 
brevity we assume that only one aircraft travels through the airspace of the ATSCC at 
one point. The UML class diagram in Figure 1 shows the entities involved, with their 
attributes and their associations which represent links between entities, e.g., references 
stored in instance variables. A plane that enters the airspace of the ATSCC registers 
its flight ID, is asked by the ATSCC for its intended route, responds by which route 
it wants to exit, is assigned a flight path (route and altitude, not necessarily the one 
requested if multiple acceptable routes exist), and is tracked until it has attained that 
flight path and exited the ATSCC airspace along that route. We now consider different 
types of applicable design specifications. 

3.1 Pre- and Postconditions 

A widespread textual specification technique is the use of Contracts [Mey92], also 
known in terms of their constituents as pre- and postconditions (and invariants). Con- 
tracts describe for individual methods in an object-oriented program the conditions 
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which must be satisfied to execute a particular method without causing an error, as 
well as the conditions which are satisfied by the state of the program after execution of 
the method. Invariants would describe conditions which hold throughout the lifecycle 
of an object, e.g., the requirement that a stack cannot hold a negative number of ele- 
ments. The degree of detail in writing contracts is generally left to the user, since the 
conditions are, with exceptions (such as the programming language Eiffel) not consid- 
ered as runtime checks but as documentation aids. 

Below we have a list of specifications for methods of the ATSCC class, written in 
the UML Object Constraint Language (OCL), expressing pre- and post-conditions in 
terms of so-called state variables. Subsets of the method parameters are listed. The 
state variables are here, as is often the case, directly derived from the class diagram, 
and intuitively correspond to instance variables of the object in question. These in- 
stance variables can either contain values or references (i.e., state changes can express 
attribute or link changes in the terminology of the class diagram). It would also be 
conceivable that the ’’state variables” would express some aggregate property of the 
object; in that case they would roughly correspond to a user defined predicate. 

Plane; Plane. 

Requested Route: integer. 

PathOffered: FItPath. 

RequestType: {unknown, land, continue}. 

Op register 
pre: Plane = nil 

post: Plane nil and RequestType ^ unknown 
Op processRouteRequest(R: 0..N) 
pre: RouteSelected = nil 

post: RouteSelected = R and PathOffered. plane = Plane 
Op trackCourse 
pre; PathOffered # nil 

post: PathOffered = nil and Plane.course = PathOffered.dir 
Op scanArrival 

pre: PathOffered = nil and Plane = nil 

Op askRoute 

pre: RouteSelected = nil 

Op dropControl 

pre: RequestType unknown 

post: PathOffered = nil and RequestType=unknown 
and plane = nil. 



A specification for the Assign Path operation of the plane could be as follows; 

Op assignPath(P: FItPath) 

pre: controller.RouteSelected / nil and newPath = nil 

post: newPath = P 

Consider the following code example. It picks a path for a plane among those 
available for a route, that (if the plane has no approach warning transponder) guarantees 
sufficient separation to earlier planes sent along it. 
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void processRouteRequest (routelD rid) ; 

{ fltPath path; 
plane p2 ; 
integer i; 

1. route = self . findRoute (rid) ; 

2. while (p.path = nil) && (i < noOfPaths){ 

3. path = route.getPath(i) ; 

4. p2 = pa th. plane; 

5. if (p.noTP() && (time - p2 . timeOf Entry >=10) { 

6. path. plane = plane; 

} } 

7. pathOffered = path } 

Structural errors that could happen in this method (focusing only on the planes) 
would include replacing statement 4 by plane = path. plane; (a WT er- 
ror), path. plane = p2; (S), p2 = plane; (WS), or statement 6 by 

path. plane = p2; (WS), plane = path. plane; (S), p2 = plane; 
(WT), and of course omitting either statement (AO errors). 

Let us consider the errors for statement 6. The first (WS) will lead to an incorrect 
time check when the next plane selects that route. The second (S) means that the 
ATSCC is tracking the wrong plane; this might lead to an error when it sends a course 
change command to that plane. The AO case base the same effect as the first in this 
example. The WT case means also that statement 7 is executed on the wrong plane and 
has no effect. 

Now assume that either the AO or WT case has happened, and the error is noticed 
in testing (e.g., because the next plane is assigned an incorrect height or the path’s 
last-plane has the wrong flight number). In the WT case tracing the dependencies 
will not include the replaced statement 6 as a potential diagnosis candidate, and of 
course the same holds in the AO case (where statement 6 is not present). 

It should be noted that CASE tools which automatically complete the code update 
the inverse pointer of a link such as the ”on” link between plane and path do not prevent 
such an error from occurring, since the incorrect assignment can of course result in 
an incorrect completion. However the code would be organized differently (the path 
assignment would not be in a separate method). 

3.1.1 Heuristic Modeling 

How can we use the contract information to restrict the diagnosis candidates? It should 
be recognized beforehand that any such use is necessarily heuristic since there is no 
guarantee that the design information is more correct than the program, except that it 
usually predates the program and therefore can be assumed to be more mature, and that 
it will be more abstract and therefore easier to check for inconsistency (although not 
necessarily for completeness). There is an obvious heuristic which can be applied to a 
correct and nonredundant contract specification. 

- If a state variable is mentioned in a postcondition but not precondition then it will 
be tested or changed in that method. 

This is because otherwise the method could not guarantee its own postcondition; for 
that the variable would either have to be correct before execution, be changed during 
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execution, or at least tested during execution to see that no change is necessary. 

- If a state variable is mentioned in a precondition but not postcondition, then it will 
be accessed (because a particular value is expected and needs to be protected) during 
execution. 

We accommodate these heuristics as follows. (Any reference to "variable” in the 
following description refers to a state variable visible outside the method.) If a variable 
V is in the postcondition of method m but not assigned, add a dependency v <= rribody 
If m calls other methods with the same condition on v then only the innermost method 
call is preferred. If v is mentioned in the precondition of v but not accessed, add v to 
all dependencies in the body of m and prefer those dependencies in diagnosis. In the 
other cases, do nothing. 

Example: Consider the two method calls below. These would be part of the ATSCC 
code, first identifying a path for the plane to take based on its chosen route and then 
assigning the path to the plane (which include informing the plane about needed course 
changes). 

processRouteRequest (rid) ; 
p.assignPath(path) ; 

The post condition forprocessRouteRequest included RouteSelected, which 
was indeed changed in the method. The precondition for assignPath includes new- 
Path, which is indeed changed in the method (assigned a value which it did not previ- 
ously have). 

Assume we have a version of processRouteRequest where statement 6 has 
been replaced by the WT error p2 = p; . If we use a dependency-based sys- 
tem description as mentioned in Section 1 for diagnosing the ATSCC code, we 
receive the variable dependencies pi 4= {po,rido,path{i)o} and pathOjfered\ 4= 
{rido,path{i)o} for the call of processRouteRequest (using p for "plane” 
and Si to refer to the first line). The dependency pathOfferedi. plane 4= 
{ndo,paf/i(i)o,po} is missing and a dependency-based diagnosis will not return si 
as a source for the error. If we add a special dependency pathOfferedi-plane 4= 
processRouteRequestiiody because pathOffered is listed in the post condition, 
then the method will be included in the search, and diagnoses containing such a special 
method reference can be preferred to other diagnoses. 

3.1.2 Value-oriented Modeling 

Harder constraints than these two heuristics can be obtained when looking at the con- 
ditions themselves. The precondition to assignPath mentions that newPath should be 
nil; this condition will obviously be violated if statement 6 is replaced by p2 = p ; . 
In general it will not be possible to analyze such cases statically. 

When using a value-based model, the nature of pre- and postcondition changes. 
They are in effect used as runtime constraints. Once diagnosis is started, they are 
tested whenever values are propagated forward out of a procedure call or backward 
before the procedure call, leading to contradictions when the constraint is not satisfied. 
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Figure 2: Statechart of class ATSCC 



3.2 UML statecharts 

The class diagram depicted above uses the notation of the currently the most widely 
used and examined of 00 design notations, the Universal Modeling Language 
(UML) [RJB99] which resulted from the merger of several older methodologies. It 
provides a variety of diagram types. Here we are interested in three, of the most impor- 
tant diagram types: class diagrams, statecharts which describe the dynamic behavior 
of the individual object classes, and collaboration diagrams which show message inter- 
action between a group of objects in a particular application situation. 

The statechart diagram in Figure 2 shows the behavior of the ATSCC entity. Tran- 
sitions in the statechart can be labeled in the form trigger event[guard]action(). Events 
cause transitions, guards have to be satisfied to permit an event to fire a transition, and 
actions or activities depict the operations executed on an object as a result of under- 
going the transition or while in a state. By default, events are assumed to correspond 
to outside method calls and the actions or activities show the code that is being ex- 
ecuted, and that is the assumption we make here. Unlabeled completion transitions 
depict a state change caused by completion of an internal activity (e.g., completion of 
trackCourse leading to state Waiting. The set of statecharts for all objects in a system 
represents a high level model of the causal structure of the system. 

The designers arrive at statecharts by a process of compilation (manual and gen- 
eralization from the set of collaboration diagrams in which a class is involved. The 
existence of the collaboration diagram guarantees that we know which class is the re- 
ceiver of a particular method corresponding to an action in the statechart diagram. We 
do not show a graphical example of a collaboration diagram but instead simply define 
it formally. Given a set of classes C, a collaboration diagram is a sequence of message 
sends mi,... ,mn where each m; is a tuple {Pij)), with Sj the sender, ri the 

receiver, and ) the list of parameters of the message send. 

The semantics of UML statecharts are not formally defined, although for our pur- 
poses we can approximate them by considering them as finite automata annotated with 
methods: transition, entry, exit, and completion actions. 

The basic concept of using statecharts in addition to contracts is the realization that 
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a statechart effectively constitutes a constraint on the sequence of method calls in a 
program. The states do generally correspond to particular attribute/link constellations 
but this information cannot be gleaned from the diagram without further annotations. 
In the simplest manner we can therefore use statecharts as additional constraints on the 
method call sequences in the program. 

3.3 Using Statecharts alone 

Importantly, both contracts and statechart diagrams are abstractions of code behavior. 
However, they abstract different issues. Statecharts are more likely to abstract state 
information but retain causality. Contracts are more likely to ignore causal information 
and focus on specifications of individual operations. 

While the conditions expressed in contracts at design level may well be real abstrac- 
tions of the actual object state (i.e., the variables referred to in pre-and postconditions 
may not be actually represented in the code), this is much more likely in the case of 
statechart diagrams. The gain in causal information is therefore offset by the need to 
construct a mapping between the abstract states and the conditions in terms of state 
variables they represent; in other words, contracts. 



3.4 Static Tests 

An obvious question is why, in those cases where we are using a dependency-based 
system description (which is statically computed), contracts and state diagrams are not 
just used statically. In principle a comparison of dependencies should give us a set of 
possible fault locations. That is correct and can in fact be done. However, neither con- 
tracts nor statechart diagrams are semantically complete and therefore actual observed 
testcases provide additional information. Static dependency analysis always implies 
overestimation. In the traditional debugging community this led to the development 
of methods like dynamic program slicing [KR97] to overcome the restrictions of static 
slicing [Wei 84]. 

In other words, static dependency analysis lists all methods where the contract spec- 
ifications do not fit our heuristics. A diagnosis run using a testcase with incorrect output 
will list those parts of the code which could possibly contribute to that testcase. 

When examining contracts at the level of state variable values, then of course static 
analysis is not possible. 

4 Discussion 

In [Bot95], the diagnosis of structural faults was treated as the discovery of hidden 
interactions. The HiDe&Seek approach hypothesized about structural faults only when 
multiple faults occurred, a particular contextual constellation (e.g., physical layout) of 
components was given, and the behavioural modes of the components were matched. 
This layering was used for effective implementation that introduced hidden interactions 
as particular working hypotheses that activated a model of the particular fault. 

In our case, the implementation side is somewhat simpler. The interactions are 
not really hidden since all assignments are visible. What is not clear is which of the 
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assignments are correct, in fact the assumption that a structural fault must be indi- 
cated by multiple fault diagnoses does not hold, and the prior presence of separate 
models must be exploited to specifically mark those cases as relevant. Apart from 
the introduced priority ordering among diagnoses the implementation therefore simply 
takes place in terms of extending the existing dependency-based [MSW99] or value- 
based [MSWOOa] diagnosis models. In both cases the overhead is limited. When 
working with the value-based model, the design information plays the role of additional 
constraints during the propagation of the model. When working with the dependency- 
based model, the number of rules added is linear in terms of the size of the contract 
conditions (i.e., the number of variables involved), whereas the number of rules in the 
model is not linear in terms of the size of the methods due to the way in which loops 
are handled. 

In comparison to repair-based approaches [SW99] the flexibility of the approach 
presented here is drastically constrained. It is also more rarely applicable than the 
models of [MSW99, MSWOOa], since it requires the existence of separate design in- 
formation. On the other hand that is its advantage; it does not require additional effort 
in cases where these widespread techniques are in use and the search space is limited. 
Information from earlier design stages was also used in [FSW99] which used model- 
based techniques to debug VHDL circuit designs. However, the information there was 
only available and presented to the system in terms of execution traces (i.e,. test cases), 
not as a high level description. Also, most importantly, the abstraction levels in the 
VHDL case were different. The ’’high level” functional specification was fully exe- 
cutable and programmed in terms of conventional code using loops (i.e., providing a 
degree of detail comparable to what we are working with at the implementation level, 
most software design documentation is much less detailed, whereas the implementa- 
tion level (RT level) replaces the loops by concurrently executing smaller components. 
Finally, the execution trace information was not used (and not suitable) to identifying 
structural faults. 

Other work includes [PW90], where the classical bridge faults were considered in 
terms of a framework that simply included all possible interaction paths, and of course 
the classical work by Davis [Dav84] addressed the modeling issue in terms of ’’adja- 
cency” in a second model that showed different interactions. His concept of adjacency 
would correspond to the fact that the design notations we utilize attempt to address the 
basic causal structure of the application, e.g., the requirements for methods to interface 
or the sequence of states in a diagram. Of course this means that interactions outside 
the realm of the diagrams (e.g., in auxiliary data structures) are not covered. 

4.1 Future Work 

At the moment the work presented here is purely conceptual and not implemented. 
Due to the richness of OO design processes the number of possible avenues for further 
research is huge. Issues include the effects ofthe fact that specifications can be for- 
mulated at differing levels of abstraction depending on application or process model. 
UML-based CASE tools automatically created code stubs from certain modeling con- 
structs that may help or hinder the application. We have not considered the possibility 
of analyzing explicitly represented iteration in diagrams (cyclic state sequences). 
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5 Conclusion 

This paper has described an approach that uses available design information produced 
by contemporary 00 design processes to deal with the concept of the structural fault in 
software. We have examined different types of structural faults involving assignment 
statements as the basic type to which others can be traced. We have examined the 
use of pre- and postconditions for method calls as the basis for heuristics that allow 
focusing on structural faults that result in omission of variables mentioned in the pre- 
and postconditions, and have extended this concept to UML statecharts, using both 
dependency- and value-based diagnosis models. 
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Abstract. Modem animation packages provide partial automation of action 
between key frames. However the creation of scenes involving many 
interacting characters still requires most of the work to be hand-done by 
animators and any automatic behavior in the animation sequence tends to be 
hard-wired and lacking autonomy. This paper describes our “EreeWill” 
prototype which addresses these limitations by proposing and implementing an 
extendable cognitive architecture designed to accommodate goals, actions and 
knowledge, thus endowing animated characters with some degree of 
autonomous intelligent behavior. 

Keywords: cognitive modeling, lifelike characters, multiagent systems, 
planning 



1 Introduction 

Modern animation packages for film and game production enable the automatic 
generation of sequences between key frames previously created by an animator. 
Applications for this exist in computer games, animated feature films, simulations, 
and digitized special effects, for example synthesized crowd scenes or background 
action. However, automated animation has, until very recently, been limited to the 
extent that characters move without autonomy, goals, or awareness of their 
environment. For example in moving from A to B a character might come into 
unintended contact with obstacles, but instead of taking avoiding action or suffering a 
realistic collision the animation package generates a scene in which the character 
simply passes through the obstacle (Figure la). Such incidents must be repaired 
manually by the human animator (Figure lb). Although recent versions of 
commercially available animation packages have incorporated limited environment 
awareness and a degree of collision avoidance, there remains considerable scope for 
applying Al to animated characters to endow them with a full animation oriented 
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cognitive model, as advocated by Funge et al (Funge, 1998; Funge et al, 1999). The 
role of the cognitive model is to provide perception, goals, decision making, and 
autonomous interaction with their surroundings and other characters. This paper 
describes our “FreeWill” prototype (Forte at al, 2000, Amiguet-Vercher at al, 2001) 
which has recently been initiated with the eventual aim of adding such capability to 
commercially available animation packages. 




Fig. lb. The scene corrected manually 
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2 Components of the System 

An animated sequence consists of characters (avatars) interacting within a graphically 
defined setting. In our system, avatars are implemented as agents, i.e. something: 
“that can be viewed as perceiving its environment through sensors and acting upon 
that environment through effectors” (Russel and Norvig, 1995). In the example 
illustrating this paper, the setting is a city street populated by avatars walking in either 
direction. Their behavior consists of walking towards a set destination, avoiding 
collisions, and stopping to shake hands with designated “friends”. Subject to fulfilling 
these goals, an avatar’s behavior is otherwise autonomous. The action is simulated in 
our software, and the information then translated to a file format that is 
understandable by the animation package. Several standard formats are available in 
the industry. In our present prototype we use 3D Studio Max as an animation package 
and we interface with it through step files and scripts written in MaxScript (e.g. as in 
Figure 2). We could also interface with other packages available in the market such as 
Maya. The animation package then renders each frame and produces a video of the 
simulated interaction of the avatars. A scene from one such video is shown in Figure 
3. 



biped .AddNewKey LarmContS 0 
biped .AddNewKey RarmContS 0 
sliderTime = 10 
rotate RForearmS 30 [-1,0,0] 

biped. AddNewKey LarmContS 10 
biped. AddNewKey RarmContS 10 
sliderTime = 20 
rotate RForearmS 80 [0,0,-l] 
biped. AddNewKey LarmContS 20 
biped. AddNewKey RarmContS 20 

Fig. 2. Sample script for generating avatar behavior 
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Fig. 3. Avatar interaction 



The class structure underpinning our system is depicted in Figure 4, which 

presents a UML (Unified Modeling Language) model currently implemented in Java. 

As shown in Figure 4 the principal classes of the system are: 

• World comprising all physical objects , including avatars, participating in the 
scene. Details stored for each object include a complete description of shape, 
dimensions, colour, texture, current position etc, sufficient to render the object. 

• Avatar , which consists of a physical body together with an AI engine , instantiated 
as a separate AI object (on-board “brain”) for each avatar. The body provides the 
Al engine with all necessary sensing and actuator services, while the AI engine 
itself is responsible for perception (interpretation of information) and the issue of 
appropriate motion commands based on goal planning. As a subsystem, the AI 
engine is built of an action planner , a motion controller , and a knowledge base 
storing goals and facts , and the avatar’s private world model (which represents 
the fragment of the virtual world currently seen and remembered by the avatar). 

• A scheduler based on discrete event simulation and a queue handler enabling the 
autonomous behavior to unfold within the virtual world by passing control to 
appropriate world objects (including avatars) according to the event which is 
currently being processed. 

• There is also one external component used to generate the final animation - the 
animation package or more generally visualization engine - this part of the 
system is responsible for displaying the world model and the interacting avatars. 
At the moment this is performed by the package 3D Studio Max as described 
above. The system can also interface other products and other formats, e,g. those 
using motion capture files. The visualization engine must also allow for rendering 
the scenes and for saving the final animation. 
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Fig. 4. UML model of the system 



3 Logic Controlling an Avatar’s Behavior 

One of the key elements of the knowledge base is the internal world model. Every 
time an avatar performs an action, the process is initiated by first updating the avatar’s 
world model. The avatar senses the world via a vision cone, through which it gains 
awareness of immediate objects in its path (see Figure 5). The information obtained 
from the vision cone is then used to modify the avatar’s plan and perform the next 
action. 
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Fig. 5. Scene as seen by an avatar 



An avatar’s behavior is goal directed. The primary goal is provided by the user 
and represents the aim of the simulation for that avatar. In the example illustrated in 
Figure 3, the primary goal is to ‘get to the end of the sidewalk’. However the 
fulfilment of this goal may be enacted with accomplishment of secondary goals which 
are set and assessed by the avatar. Examples are ‘avoid collisions’ and ‘shake hands 
with friends’. Such goals are a part of the avatar’s knowledge. When to give such 
goals priority can be inferred from the current world state. The rules of an avatar’s 
behavior are stored in the knowledge base as sets of facts and rules. The knowledge 
base also provides logical information about static world objects and other avatars 
(e.g. a list of friends). The logic controlling the avatar’s behavior is as follows: 



DoSensing ( ) 

{ 

image = Body. Sense () 



return VisionCone . Getimage () 



{ 

} 

Mind . UpdateWorldModel (image) 

{ 

KnowledgeBase .ModifyWorld (image) 

{ 

WorldModel .ModifyWorld (image) 

} 

} 

Mind . RevisePlan ( ) 

{ 

ActionPlanner . Plan ( ) 



KnowledgeBase .GetGoals ( ) 
ExploreSolutions ( ) 
KnowledgeBase . GetObj ectinf o ( ) 
{ 
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WorldModel . GetOb j ectAttribs ( ) 

} 

CreatePlan ( ) 

lastAction = SelectLastPlannedAction ( ) 

MotionControl .Decompose (lastAction) 

} 

} 

action = Mind . PickAction ( ) 

{ 

microA = ActionPlanner . GetMicroAction ( ) 

{ 

return MotionControl . GetCurrentAction ( ) 

} 

return microA 

} 

return ConvertActionToEvent (action) 

} 

Fig. 6. Logic controlling an avatar’s behavior 

The main simulation loop is located within the Scheduler class which 
consecutively picks events from an event queue. Control is then passed to the 
appropriate world object to which the event refers (which in most cases is an avatar) 
and necessary actions are taken. These can be 

an ‘act’ action - such as move a hand or make step. The action is rolled out (the 
avatar’s state variables are updated) and a new line is added to the MaxScript file. 
This action returns a new sensing event to be inserted in the event queue 
a ‘sense’ action - which means that the avatar should compare the perceived 
fragment of the world with its own internal model. Then the avatar has a chance 
to rethink its plan and possibly update goals and the planned set of future actions. 
This action returns a new acting event. 

The returned actions are inserted in the event queue and the time is advanced so that 
the next event can be selected. A PeriodicEventGenerator class has been introduced to 
generate cyclic sensing events for each avatar so that even a temporarily passive 
avatar has its internal world model updated. 

The goal-planning algorithm constructs plans using the notion of an action as a 
generic planning unit. An action can be defined on various levels of specialization - 
from very general ones (e.g. ‘get to the end of the sidewalk’) to fairly detailed 
activities (‘do the handshake’). The most detailed actions (microactions) are said to be 
at level 0. They correspond to action events in the event queue and also to MaxScript 
file entries. In general every action is specified by a pre and postcondition and is 
implemented by an avatar’s member function, which will perform the action and 
update the state of objects affected by it. These objects can be world objects or parts 
of the avatar’ s body. The planning unit (ActionPlanner) operates on actions from level 
N to 1 - creating general plans and then refining them. The ActionPlanner maintains 
the chosen plan from which the last action is submitted to the MotionControl unit. It 
is then decomposed into a set of level 0 microactions (e.g. handshake consists of a set 
of arm and hand movements) which can be executed one by one. Any change in the 
plan may cause the list of microactions to be dropped and new ones to be generated. 
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If an action event is pulled from the queue then the scheduler updates the 
appropriate property of the world object that owns the event. At the same time the 
scheduler passes the information of that movement to the interface with the animation 
package so as to update the state of the world that will be displayed in the animation. 



4 Conclusion and Future Direction 



This paper has explained our framework for supporting autonomous behavior for 
animated characters, and the mechanisms that drive the characters in the simulation. 
The resulting actions are rendered in an animation package as illustrated. Our current 
prototype indicates that there is considerable scope for the application of AI to the 
automatic generation of animated sequences. In the current system the 
implementation of goal based planning is inspired by STRIPS (Pikes and Nilsson, 
1971; Pikes, Hart and Nilsson, 1972). As a next step it would be interesting to extend 
our framework to experiment with planning activity that is distributed across several 
agents and takes place in a dynamic complex environment requiring the intertwining 
of planning and execution. Such requirements imply that goals may need to be 
changed over time, using ideas described for example by Long et al (Long, 2000). 
The prototype we have developed is a useful environment for developing and testing 
such cognitive architectures in the context of a practical application. 
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Abstract. Ignoring the noise of physical sensors and effectors has al- 
ways been a crucial barrier towards the application of high-level, cog- 
nitive robotics to real robots. We present a method of solving planning 
problems with noisy actions. The approach builds on the Fluent Calculus 
as a standard first-order solution to the Frame Problem. To model noise, 
a formal notion of uncertainty is incorporated into the axiomatization of 
state update and knowledge update. The formalism provides the theoret- 
ical underpinnings of an extension of the action programming language 
Flux. Using constraints on real-valued intervals to encode noise, our 
system allows to solve planning problems for noisy sensors and effectors. 



1 Introduction 

Research into Cognitive Robotics aims at explaining and modeling intelligent 
acting in a dynamic world. Whenever intelligent behavior is understood as re- 
sulting from correct reasoning on correct representations, the classical Frame 
Problem [12] is a fundamental theoretical challenge: Given a representation of 
the effects of the available actions, how can one formally capture a crucial regu- 
larity of the real world, namely, that an action usually does not have arbitrary 
other effects? Explicitly specifying for each single potential effect that it is ac- 
tually not an effect of a particular action, is obviously unsatisfactory both as a 
representation technique and as regards efficient inferencing [3]. The predicate 
calculus formalism of the Fluent Calculus [15], which roots in the logic program- 
ming approach of [7], provides a basic solution to both the representational and 
the inferential aspect of the Frame Problem. This solution also forms the the- 
oretical underpinnings of the action programming language Flux (the Flu ent 
Calculus Executor) [16], which is based on constraint logic programming and 
allows to specify and reason about actions with incomplete states, and thus to 
solve planning problems under incomplete information. 

In order to make it possible for a robot to reason about the correct use of 
its sensors, the Fluent Calculus has been extended by an axiomatization of how 
sensing affects the robot’s knowledge of the environment [17]. A corresponding 
extension of Flux has been developed in [18], which allows to solve planning 
problems with sensing actions. However, the method shares a common assump- 
tion of high-level approaches to sensing, namely, that sensors are ideal. This 
ignoring the noise of physical sensors and effectors has always been a crucial 
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barrier towards the application of cognitive robotics to real robots, because noisy 
sensors and effectors may take influence on the correctness of plans. 

In this paper, we extend the existing model for acting and sensing in the 
Fluent Calculus to the representation of noise in both sensors and effectors. 
Actions with noise result only in limited certainty about the values of the affected 
state variables. As part of the approach we define a notion of executable plans 
in which the robot may condition its further actions on previously obtained 
sensor readings. In the second part of the paper, we present an extension of the 
action programming language and system Flux which allows to solve planning 
problems with noisy actions, using constraints on real- valued intervals to encode 
noise. Prior to presenting these results, we give a brief introduction to the basic 
Fluent Calculus and Flux. 

2 FLUX 

The action programming language Flux [16] is a recent implementation of the 
Fluent Calculus using constraint logic programming. The Fluent Calculus com- 
bines, in classical logic, elements of the Situation Calculus [9] with a STRIPS-like 
solution to the Frame Problem [15]. The standard sorts action and SIT (i.e., 
situations) are inherited from the Situation Calculus along with the standard 
functions S'o : SIT and Do : action x sit i-^- sit denoting, resp., the initial sit- 
uation and the successor situation after performing an action; furthermore, the 
standard predicate Pass : action x sit denotes whether an action is possible 
in a situation. To this the Fluent Calculus adds the sort state with sub-sort 
FLUENT. The Fluent Calculus also uses the pre-defined functions 0 : state; 
o : STATE X STATE i-^- STATE; and State : SIT i-^- state; denoting, resp., the 
empty state, the union of two states, and the state of the world in a situation. 
As an example, let the function Dist : IR i-^- fluent denote the current dis- 
tance of a robot to a wall. If 2 is a variable of sort state, then the following 
incomplete state specification says that initially the robot is somewhere between 
4.8m and 5.1m away from the walld 

(3x, z) ( State(So) = Dist{x) o z A 4.8m < x < 5.1m ) (1) 

That is, the state in the initial situation is composed of the fluent Dist{x) and 
sub-state z representing arbitrary other fluents that may also hold. 

Based on the general signature, the Fluent Calculus provides a rigorously 
logical account of the concept of a state being characterized by the set of fluents 
that are true in the state. This is achieved by a suitable subset of the Zermelo- 
Fraenkel axioms, stipulating that function o behaves like set union with 0 as 
the empty set (for details see, e.g., [18]). Furthermore, the macro Holds is used 
to specify that a fluent is contained in a state: 

Holdsif, z) = (3z') z = foz' (2) 

^ Free variables in formulas are assumed universally quantified. Variables of sorts 
ACTION, SIT, FLUENT, and STATE shall be denoted by the letters a, s, /, and 
z, resp. The function o is written in infix notation. 
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A second macro, which reduces to (2), is used for fluents holding in situations: 
Holds{f,s) = Holds{f, State(s)) 

As an example, consider the following so-called state constraint, which stipulates 
that the distance to the wall be unique in every situation: 

(Vs) (3!a;) Holds{Dist{x), s) 

The Frame Problem is solved in the Fluent Calculus using so-called state 
update axioms, which specify the difference between the states before and after 
an action. The axiomatic characterization of negative effects, i.e., facts that 
become false, is given by an inductive abbreviation which generalizes STRIPS- 
style update to incomplete states: 

z' = z — f = [z' o f = z\/ z' = z] A ~^Holds{f, z') 

z' = z - {f I o ... o o /„+i) = 

{3z”) {z” = Z - {fi o ... O fn) A z' = z" - fn+l) 

This is the general form of a state update axiom for a (possibly nondeterministic) 
action A(x) with a bounded number of (possibly conditional) effects: 

Poss{A{x), s) D 

(3yi) (Z\i(T, iji, State(s)) A State{Do{A{x) , s)) = {State(s) — -d)") o d)^) 

V ... V 

(3y„) (An{x, yn, State(s)) A State{Do{A{x), s)) = (State{s) — d”) o d+) 

where the sub-formulas Ai{x,yi, State(s)) specify the conditions on State(s) 
under which A{x) has the positive and negative effects 'df and , resp. Both 
■df and -d~ are state terms composed of fluents with variables among x, iji . 
If n = 1 and Z\i = True, then action A{x) does not have conditional effects. 
If n > 1 and the conditions Ai are not mutually exclusive, then the action is 
nondeterministic . 

Consider, as an example, the function MoveFwd : i-^- action denoting 

the action of the robot moving a certain (positive) distance towards the wall. 
Under the assumption that the effectors are ideal, the effect of this axiom can 
be axiomatized by the following state update axiom: 

Poss{MoveFwd{d),s) D 

(3a;, y) {Holds{Dist{x), s)Ay = x — dA (3) 

State{Do{MoveFwd{d), s)) = (State{s) — Dist{x)) o Dist{y)) 

Put in words, moving the distance d towards the wall has the effect that the 
robot is no longer x units away from the wall and will end up at x — d. 
Recall, for example, formula (1) and suppose for the sake of argument that 
Poss{MoveFwd{2m), So) . After combining the inequations, our state update ax- 
iom and the foundational axioms imply 

(3j/, z) ( State(Do(MoveFwd(2m), So)) = z o Dist{y) 

A 2.8m < t/ < 3.1m ) 



(4) 
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A crucial property of this new state equation is that sub-state 2 has been carried 
over from (1). Thus any additional constraint on z in (1) equally applies to the 
successor state State{Do{MoveFwd{2m), Sq)) . This is how the Frame Problem 
is solved in the Fluent Calculus. 

Based on the theory of the Fluent Calculus, the distinguishing feature of the 
action programming language Flux is to support incomplete states, which are 
modeled by open lists of the form 



ZO = [FI, . . . ,Fm I Z] 



(encoding the state description ZO = FI o . . . o Fm o Z), along with constraints 
not_holds(F, Z) 

not_holds_all( [XI, . . . ,Xk] , F, Z) 

encoding, resp., the negative statements (3y) ^ Holds (F,Z) (where y are the 
variables occurring in F) and (3y)(VXl, . . . , Xk) ^i/oWs(F, Z) (where y are the 
variables occurring in F except XI, . . . ,Xk). These two constraints are used to 
bypass the problem of ‘negation-as-failure’ for incomplete states. In order to pro- 
cess these constraints, so-called declarative Constraint Handling Rules [4] have 
been defined and proved correct under the foundational axioms of the Fluent 
Calculus. In addition, the core of Flux contains definitions for holds (F,Z), 
by which is encoded macro (2), and updateCZl ,ThetaP,ThetaN,Z2) , which 
encodes the state equation Z2 = (Zl — ThetaN) o ThetaP. 

As an example, the following is the Flux encoding of our state update ax- 
ioms (3) (ignoring preconditions) and the initial specification (1):^ 

state_update (Zl , move_forward(D) , Z2) 
holds (dist (X) , Zl) , Y = X - D, 
updateCZl, [dist(Y)], [dist(X)], Z2) . 

init(ZO) X :: 4.8. .5.1, holds (dist (X) , ZO) , 
duplicate_f ree (ZO) . 

where the constraint duplicate_f ree (Z) means that list Z does not contain 
multiple occurrences. The following sample query computes the conclusion made 
in (4): 

[eclipse 1]: init(ZO), state_update (ZO , move_forward(2) , Zl) . 

ZO = [dist(X{4.8. .5.1}) I _Z] 

Zl = [dist(Y{2.8. .3.1}) I _Z] 



^ Throughout the paper we use ECLIPSE-Prolog notation. The interval expression 
;X: :L. .R; is taken from the library RIA, the constraint solver for interval arith- 
metic (see Section 4 below). 
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3 Update for Noisy Actions 

The Fluent Calculus provides a simple and elegant means to axiomatize noisy 
effectors. Uncertainty regarding the values of affected fluents can be represented 
in a state update axiom as existentially quantified and constrained variables. 
For example, suppose that the effectors of our robot are noisy in that the actual 
position after moving towards the wall may differ from the ideal one by the 
factor 0.05. The following is a suitable state update axiom for this noisy action: 

Poss{MoveFwd{d),s) D 
{3x,y) {Holds{Dist{x),s) f\ 

State{Do{MoveFwd{d) , s)) = {State{s) — Dist{x)) o Dist{y) A 
\y — {x — d)\ < d ■ 0.05 ) 

Moving a distance d thus has the effect that the robot will end up at some 
distance y which is at most d ■ 0.05 units away from the goal position x — d. 

To represent knowledge in the Fluent Calculus and to reason about sensing 
actions, the predicate KState : SIT x state has been introduced in [17]. An 
instance KState{s, z) means that according to the knowledge of the planning 
robot, z is a possible state in situation s. On this basis, the fact that some 
property of a situation is known to the robot is specified using the macro Knows, 
which is defined as follows: 

Knows{(p,s) =* (Wz) {KState{s, z) D FlOLDS{ip, z)) (6) 



where 

HOLDS{a, z) a (a arithmetic constraint) 

HOLDSif, z) = Holdsif, z) 

HOLDS{^if, z) = ^HOLDS{ip, z) 

F[OLDS{ip A i/i, z) '= FlOLDS{ip, z) A FlOLDS{ip, z) 

HOLDS {{Vx) (fi, z) = (Vx) HOLDS {(fi, z) 

This model of knowledge uses pure first-order logic. As an example, the precon- 
dition for the action MoveFwd can be specified in such a way that the robot 
always keeps a safety distance to the wall. This of course requires to take into 
account the uncertainty of the effectors: 

Poss{MoveFwd{d), s) = Knows{ {y!x){Dist{x) Dx — 1.05 • d > 0.1m), s) 

Hence, moving forward is possible only if the robot knows that it will end up 
at least 0.1m away from the wall. Suppose given that the robot knows that its 
initial position is somewhere between 4.8m and 5.1m away from the wall, that 
is, 

KState{So, z) D (Vx) (Holds{Dist{x), z) D 4.8m < a; < 5.1m ) (7) 

With the help of macro (6) it follows that Poss{MoveFwd{2m) , So) ■ 

The Frame Problem for knowledge is solved by axioms that determine the 
relation between the possible states before and after an action. More formally. 
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the effect of an action A{x), be it sensing or not, on the knowledge is specified 
by a knowledge update axiom of the form 

Poss(^(f),s) D , . 

[{yz){KState{Do{A{x),s),z) = {3z'){KState{s,z') f\d^{z,z' ,s)))] ' 

In case of non-sensing actions, formula 'I' defines what the robot knows of the 
effects of the action. E.g., state update axiom (5) for moving with unreliable 
effectors corresponds to the following knowledge update axioms: 

Poss{MoveFwd{d),s) D 

[ (Vz) {K State {Do{MoveFwd{d), s), z) = 

(3cc, y, z') ( KState{s, z') A Flolds{Dist{x), z') A (9) 

z = {z' — Dist{x)) o Dist{y) A 
\y — {x — d)\ < d ■ 0 . 05 ) ) ] 

The generic action term Sense{f) has been introduced in [ 17 ] to denote 
the action of sensing whether a fluent / holds. The corresponding knowledge 
update axiom, 

Poss{Sense{f), s) D 

\KState{Do{Sense{f),s),z) = ( 10 ) 

KState{s, z) A [P[olds{f, z) = FLolds{f, s)] ] 

says that among the states possible in s only those are still possible after sensing 
which agree with the actual state of the world as far as the sensed fluent is 
concerned. An important implication is that after sensing / either the fluent or 
its negation is known to hold [ 17 ]. Thus axiom ( 10 ) can be viewed as modeling 
an ideal sensor. 

In order to model sensing with noise, axiom schema (8) needs to be used in a 
different manner, where formula F restricts the possible states to those where 
the value of the sensed property may deviate from the actual value within a 
certain range. To this end, we introduce the generic ACTION function Sense p 
where F can be any domain function of type R i— > fluent.^ For later purpose, 
we assume that for any fluent which can thus be sensed there is an additional 
fluent S ensor Reading p{x) denoting the last sensor reading. Let gp denote the 
maximal deviation of the noisy sensor reading from the actual value. The effect 
of noisy Sense p is then specified by this knowledge update axiom: 

Poss{Sensep, s) D 

[ (3r) (Vz) {K State {Do{Sensep, s), z) = 

(3r', X, y, z') {KState{s, z') A 

Flolds{F{x), s) A ( 11 ) 

Flolds{F{y), z') A Flolds{SensorReading p{r') , z') A 
z = (z' — SensorReading p{r')) o SensorReading p(r) A 
\x - r\ < gp A \y - r\ < gp)] 

® For the sake of simplicity, we assume that each sensor delivers just a unary value. The 
generalization to perceivable fluents with multiple arguments, such as the position 
in a two-dimensional space, is straightforward. 
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Put in words, after sensing, a sensor reading r is obtained which differs from 
the actual value x by at most qf, and only those states are still considered 
possible where the value y of the sensed fluent deviates from r by at most qf 
and where the old sensor reading r' has been updated to r . The accompanying 
state update axiom says that a new sensor reading is obtained within the allowed 
range: 



Poss{SenseF, s) D 

(3r,r',x) ( -Holds (F(x), s) A Holds{SensorReading Fir'), s) A 

State{Do{SenseF,s)) = {State{s) — S ensor Reading F{r')) (12) 

o Sensor Reading F{r) 



A |a; — r| < pf ) 



While precise knowledge of the sensed property is thus no longer guaranteed, 
an important consequence of the generic knowledge update axiom (11) is that 
sensing never cause loss of possibly more precise initial knowledge wrt. the sensed 
fluent. 

Proposition 1. Let a he an arithmetic constraint with free variable x, then 
(11) and the foundational axioms entail 

Knowsi{3x) iF{x) A a), s) D Knowsi{3x) iF{x) A a), Do{SenseF, s)) 



A second crucial consequence is that the sensor reading itself will be known by 
the robot. 

Proposition 2. (11) and the foundational axioms entail 

(3x) Knows{SensorReading p{x) , Do{SenseF, s)) 



Suppose, for example, that the noise of the robot’s sensor for measuring the 
distance to the wall is given by goist = 0.1m, and recall the specification of initial 
knowledge given in (7). Then after moving 2m towards the wall and sensing the 
distance, the robot will know the distance to the wall within a range of 0.1m. 
Moreover, this distance must be between 2.7m and 3.2m (as above) while the 
possible sensor readings are between 2.6m and 3.3m due to the noise of the 
sensor. Formally, let S 2 = Do{SenseDist, Do{MoveFwd{2m), So)) , then the state 
and knowledge update axioms for MoveFwd and Sense oist entail 



(3x) ( 2.7m < X < 3.2m A Knows{{3y) {Dist{y) A\y — x\ < 0.1m), A 2 ) ) 
A (3r) (2.6m < r < 3.3m A Knows{S ensor Reading p-i^-^ir), S 2 ) ) 



Solutions to planning problems with noisy sensing actions may require to 
condition an action on the outcome of sensing. Suppose, for example, the goal 
of our robot is to be at a point where it is between 2.8m and 3.2m away from 
the wall. Given the initial knowledge state (7), this problem is solvable only by 
a plan which includes reference to previous sensing readings. A possible solution 
along this line would be for the robot to advance 2m, measure the distance, and 
if necessary adjust its position according to the obtained reading r. A suitable 




502 



M. Thielscher 



choice of the argument for the second MoveFwd action is r — 3m: As we have 
seen above, the robot knows that it will be between 2.7m and 3.2m away from 
the wall after the initial move. Moreover, the sensor reading r will measure 
the actual position x within the range x ± 0.1m. Thus, even with the added 
uncertainty caused by the new movement, the robot can be sure without further 
sensing that it will end up at a distance which is between 2.885m and 3.12m. 

In order to allow for the use of sensor readings as parameters for actions, 
we need to make precise when formal actions such as MoveFwd{r — 3m) can 
be considered executable by the robot. To this end, we introduce the macro 
Kref{r, s) (inspired by [14]) with the intended meaning that the arithmetic 
expression r can be evaluated by the robot on the basis of its knowledge in 
situation s. The macro is inductively defined as follows: 

Kref{c,s) True (c constant) 

Kref{F,s) (3a;) Knows{F{x), s) {F value fluent) (13) 

Kref{op{Ti,T 2 ),s) = Kref{Ti,s) A Are/(r 2 ,s) {opG {+,-,-,...}) 

The executability of a plan Do{an, ■ • • , Do{ai, Sq) . . .) is then defined as the 
macro EXEC as follows: 

EXEC{So) = True 

EXEC{Do{A{ti, . . . ,Tk), s) = EXEC{s) A Poss(A(ti, . . . , r^), s) 

A Kref{T\, s) A ... A Kref{rk, s) 

4 Planning with Noise in Flux 

Encoding state update axioms for noisy actions requires to state arithmetic con- 
straints. A constraint solver is then needed to deal with these constraints. A 
suitable choice for Flux is the standard Eclipse constraint system RIA (for real 
number interval arithmetic). Incorporating this constraint module, the follow- 
ing is a suitable encoding of state update axiom (5), specifying an action with 
unreliable effectors: 

: - lib(ria) . 

state_update (Z1 , move_forward(D) , Z2) :- 
holds (dist (X) , Zl) , 
abs(Y-(X-D)) *=< 0.05*D, 
updateCZl, [dist(Y)], [dist(X)], Z2) . 

In comparison with the computed answer shown at the end of Section 2, the 
new state update axiom causes a higher degree of uncertainty wrt. the resulting 
position: 

init(ZO) :- X :: 4.8. .5.1, holds (dist (X) , ZO) , 
duplicate_f ree (ZO) . 
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[eclipse 1]: init(ZO), state_update (ZO , move_forward(2) , Zl) . 

ZO = [dist(X{4.8. .5.1}) I _Z] 

Zl = [dist(Y{2.7. .3.2}) I _Z] 

While the explicit notion of possible states leads to an extensive framework 
for reasoning about knowledge and noisy sensing, automated deduction becomes 
considerably more intricate by the introduction of the modality-like KState 
predicate. As a consequence, in [18] we have developed an inference method 
which avoids separate update of knowledge and states. In what follows, we ex- 
tend this result to noisy sensors and effectors and show how knowledge updates 
are implicitly obtained by progressing an incomplete state through state update 
axioms. 

Our approach rests on two assumptions. First, the planning robot needs to 
know the given initial specification (!>{State{So)) , and this is all it knows of S'o, 
that is, KState{So,z) = <l>{z). Second, the robot must have accurate knowledge 
of its own actions. That is, formally, the possible states after a non-sensing action 
are those which would be the result of actually performing the action in one of 
the previously possible states: 

Definition 1. [17] A set of axioms E represents accurate effect knowledge if 
for each non-sensing action function A, E contains a unique state update 
axiom 

Poss{A{x), s) D Fa{z / S tate{Do{A{x) , s)), z' j State{s))} (14) 

(where Fa{x, z, z') is a first-order formula with free variables among x, z, z' and 
without a sub-term of sort SIT ) and a unique knowledge update axiom which is 
equivalent to 



Poss{A{x),s) D l(Vz) (KState(Do(A(x), s), z) = 

(Bz')(KState(s, z') A Pa(x, z, z')) ) ] 



(15) 



Accurate knowledge of effects suffices to ensure that the possible states after 
a non-sensing action can be obtained by progressing a given state specification 
through the state update axiom for that action. The effect of sensing, on the 
other hand, cannot be obtained in the same fashion. To see why, let S' be a 
situation and consider the knowledge specification 



KState{S, z) = 

{3x,y) {z = Dist{x) o SensorReading A 4.8m <x< 5.1m) 



(16) 



Suppose that Poss^Senseoist, S), then knowledge update axiom (11) yields dif- 
ferent models reflecting the possible sensing result r, which run from 4.7m to 
5.2m. However, in each model the distance is known up to an error of just goist = 
0.1m! Hence, while we cannot predict the sensing outcome, it is clear that the 
sensed value will be known within the precision of the sensor. This knowledge is 
not expressible by a specification of the form KState{Do{Sense oist, S),z)=d>{z) 
entailed by (16) and (11). Hence, the effect of a sensing action cannot be obtained 
by straightforward progression. 
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In order to account for different models for KState caused by sensing, we 
introduce the notion of a sensing history <r as a finite, possibly empty list of 
real numbers. A history is meant to describe the outcome of each sensing action 
in a sequence of actions. For the sake of simplicity, we assume that the only 
sensing action is the generic Sensep with knowledge update axiom (11) and 
state update axiom (12). 

For the formal definition of progression we also need the notion of an action 
sequence cr as a finite, possibly empty list of ground action terms. An action 
sequence corresponds naturally to a situation, which we denote by S^'- 

5[] 5o and I Do{A{t),S.) 

We are now in a position to define, inductively, a progression operator V{a, g, z) 
along the line of [18], by which an initial state specification <l>{ State (So)) is 
progressed through an action sequence a wrt. a sensing history g, resulting in 
a formula specifying z: 



7^([],c,z) = <?(z) if<r=[] (17) 

V{[A{t)\a],g,z) (3A) (7^(a, A) A 7A(M, z')) (18) 

if A non-sensing with state update (14) 

V{[Sensep | cr], c, z) = 

{3r',x,y,z'){V{a,g',z')A 

Holds (F{x), s) A 

Holds {F (y) , z') A Holds{SensorReading p{r'), z') A (19) 

z = (z' — SensorReading p{r')) o SensorReading p{r) A 
\x -r\< qf A \y -r\< qf) 

where = [r j c^] 

In case the length of the history c does not equal the number of sensing actions 
in cr, we define V{a,g,z) as False. Progression provides a provably correct 
inference method for knowledge update. 

Theorem 1. Consider the initial state and knowledge Fq = {<P{State{So)), 
KState{So, z) = <P{z)} and let E be the foundational axioms plus a set of domain 
axioms representing accurate effect knowledge. Let a be an action sequence such 
that E U Eq ^ EXEC{Sa) . Then for any model A4 of Eg U E and any 
valuation v, 

\= KState{S„,z) iff Aijiy \= V{a,g, z) for some g 

The proof is by simple induction on a. 

This theorem serves as the formal justification for the Flux encoding of 
knowledge and sensing. The sensing action Sense oist, for example, is encoded 
by a state update axiom which carries as additional argument the result of 
sensing, that is, the sensor reading: 
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state_update (Z1 , sense_dist, Z2, SV) 

holds (dist (X) , Zl) , holds (reading (Y) , Zl) , 
abs(X-SV) *=< 0.1, 

updateCZl, [reading(SV)] , [reading (Y) ] , Z2) . 

The definition of progression is a direct encoding of (17)-(19): 

p( [] , [] , Z) init(Z) . 
p([A|S] , H2, Z2) p(S, HI, Zl), 

( state_update(Zl , A, Z2) , H2=H1 ; 

state_update(Zl, A, Z2, SV) , H2=[SV|H1] ). 

In principle, the Flux clauses we arrived at can readily be used by a simple 
forward-chaining search algorithm. Enumerating the set of plans, including all 
possible sensing actions, a solution will eventually be found if only the problem 
is solvable. However, planning with incomplete states usually involves a con- 
siderable search space, and the possibility to generate conditional plans only 
enlarges it. The concept of nondeterministic robot programs has been intro- 
duced in Golog as a powerful heuristics for planning, where only those plans 
are searched which match a given skeleton [10]. This avoids considering obvi- 
ously useless actions such as ineffectual sensing. In [18] we have shown how this 
concept can be adopted in Flux on the basis of a progression operator, in order 
to make planning with sensing more efficient. These heuristics can be directly 
applied to planning with noisy actions in Flux. 

5 Summary and Discussion 

We have presented an approach to planning with noisy actions by appealing to 
the Fluent Calculus as a basic solution to the Frame Problem. The axiomatiza- 
tion has be shown to exhibit reasonable properties. Moreover, we have extended 
the action programming language Flux to obtain a system for solving planning 
problems that involve noisy actions. 

Both the axiomatic approach as well as the realization in Flux are an exten- 
sion of the solution to the Frame Problem for knowledge [17,18]. A distinguishing 
feature of this approach is its expressiveness in comparison to most existing ap- 
proaches to planning with knowledge and sensing. Unlike other systems. Flux 
is not tailored to restricted classes of planning problems (as opposed to, e.g., [6, 
5,2,11,8]) and allows to search for suitable sensing actions during planning (as 
opposed to [13]). 

Closest to our work is [1], where an extension of the Situation Calculus is 
presented that allows to axiomatize noisy actions. The crucial difference to our 
approach is the indirect way of modeling a noisy action as a non-deterministic 
selection among actions with determined effects. To this end, the approach uses 
the non-deterministic programming constructs of Golog for modeling noise. 
Consequently, these programs can no longer be used as planning heuristics, and 
therefore the theory cannot be straightforwardly integrated into Golog to pro- 
vide a planning system that deals with noise. On the other hand, the approach 
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of [1] includes a notion of probability distribution for noisy effects. The extension 
of our approach along this line is an important goal for future work. 
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Abstract. In nature animals going about their daily routine need to 
avoid predators in order to survive. Many animals have evolved some 
kind of startle response, which enables them to escape from dangerous 
situations. As robots are required to operate in more hostile environ- 
ments, mechanisms analogous to startle responses may be critical to 
building robust systems. This paper presents some preliminary work ex- 
ploring (1) how some reactive evasive behaviours can be added to an 
agent operating in a hostile environment, and (2) how evasive measures 
can be integrated with the agent’s other activities. 



1 Introduction 

Almost every animal is required to escape from predators from time to time in 
order to survive and reproduce. Animats may also need to escape from sources 
of danger, such as malevolent passers by and curious children [18]. The key is 
not only to have effective escape mechanisms but also to integrate escape with 
the animat’s other activities. An animal that spends all its time running away 
without stopping to eat and find food will not survive very long. 

A startle response for escaping quickly in critical situations is an adaptation 
which has arisen in many different animals from disparate parts of the evolu- 
tionary tree [7]. The startle responses vary greatly. In common they usually have 
simple reliable triggers, very fast activation and often produce a stereotyped re- 
sponse [3]. The fact that almost all animals have some kind of startle response 
suggest that there is an advantage to having a dedicated subsystem for detecting 
and responding to hazardous situations. Hoy [13] argues that startle responses 
will have a significant role in designing robust robot architectures. 

A relatively ‘simple’ invertebrate animal that is particularly well studied by 
neuroscientists is the crayfish. (This is because they have relatively few neurons, 
some of the neurons are very large and crayfish make a delicious meal at the end 
of the experiment.) The crayfish is equipped with a variety of evasion techniques 
including (but not limited to): spending large amounts of time in hiding; retreat- 
ing to safety if it notices a far away predator; and an escape response [19]. The 
escape response is a last resort mechanism that is used in extreme conditions. 
It is activated by a sharp tap to the abdomen or sudden visual stimuli. The 
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response is a stereotyped tailflip that rapidly propels the crayfish away from the 
source of danger. The trigger for the escape reflex is controlled by giant com- 
mand neurons in the abdomen. These neurons are responsible for integrating a 
large group of stimuli and making a snap decision. For a detailed reference to 
the neural organisation of the crayfish escape mechanism refer to [25,24,17,15, 
16 ]. 

Some previous studies that have drawn inspiration from invertebrate neural 
circuits have yielded promising results. Beer [1] successfully modelled the neurons 
controlling insect walking with an R-C network. The circuit was fully distributed, 
efficient and robust; later it was used to control a real hexapod robot. There has 
also been work on modelling the escape response of the cockroach. [2] 

There have been many previous studies of evasion in isolation. A couple of ex- 
amples of evolving optimal strategies in scenarios with fixed predator behaviour 
include [14] and [12]. Miller and Cliff [18] co-evolved pursuer and evader tactics 
using noisy neural network controllers. The pursuer-evader problem has been 
reformulated as a one-dimensional, time-series prediction game. [10] There has 
been exploration of the evolution of evasion strategies when the game is made 
slightly asymmetric between the pursuer and evader. [23] 

This paper presents some preliminary work in exploring how some simple 
reactive evasive behaviours can be added to an agent operating in a hostile 
environment. The mechanisms used are loosely inspired by the evasive tactics 
and reactions of the crayfish. 

2 The Scenario 

The scenario is a simple predator-prey simulation. There is one predator and 
one prey. The prey has the task of collecting enough food to survive while being 
hunted by a predator. The predator has a greater maximum velocity and a 
further seeing distance than the prey. The prey has superior acceleration over 
the predator and may choose to hide in a shelter where it is safe from the 
predator. 

The environment is a continuous two-dimensional plane of n x to units. It 
has wraparound edges (this is to avoid the artifact of the prey being trapped in 
a corner.) The world contains pieces of food located at random locations. New 
pieces of food are added and old ones are removed at random time intervals. 
Also situated in the environment is a shelter. When the prey is in the shelter 
the predator is unable to see it and unable to kill it. 

The predator and prey are able to move within the world and are able to 
make some limited interactions with the other entities in the world. The prey is 
able to eat a piece of food if it is close enough. The predator is able to kill the 
prey if it is close enough. The simulation is updated in discrete time steps. At 
each time step the predator and prey are queried by the simulation engine about 
their intended movements and other actions they want to take. All updates to 
positions and interactions are executed simultaneously. If the predator or prey 
elect to change their velocity, they are not able to do it instantly but instead 
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accelerate to new velocities. It may take several time steps for the predator or 
prey to reach their new velocity. The predator and prey each have maximum 
rates of acceleration. 

The predator follows a very simple behaviour pattern. This behaviour pat- 
tern is fixed for all the experiments. The predator roams around at half-speed 
travelling in a straight line. At random time intervals it changes to a new random 
direction. The predator is continuously looking for the prey. If at any point the 
predator spots the prey it will immediately change its direction to head directly 
toward it and accelerate to its maximum velocity. When the predator gets within 
a distance of kkui units of the prey, the predator kills the prey and then eats its. 
The predator is present somewhere in the world for the entire duration of the 
simulation. 

The prey has an internal energy level. To avoid starving it must maintain its 
energy level above zero. At each time interval the prey consumes an amount of 
energy determined by equation 1. 

AE=-{B + Av^) (1) 

The base energy consumption (determined by B) forces the prey to occasionally 
go and collect food. The other term is dependent on the square of the prey’s 
velocity to penalise travelling at high speeds. The prey replenishes its energy 
level by eating food. The prey needs to move close to a piece of food before it 
can eat it. When a piece of food is eaten the food is removed and the prey gains 
k units of energy. If E drops below zero the prey starves to death. 

The simulation was written in Java. There are two interfaces: a graphical ap- 
plet interface and a command line interface. The applet interface can be accessed 
on the web at http://www.cs.mu.oz.au/~scv/botsim/ 



3 Architecture of the Prey 

The prey uses a layered architecture which we build up incrementally [20], [4]. 
The prey operates in distinct behavioural modes. It monitors the level of some 
simple stimuli, such as ’hunger’, to determine which behavioural mode to oper- 
ate in. At the most basic stage the prey ignores the predator completely and is 
solely focused on collecting enough food to avoid starving. At each stage, an- 
other behavioural mode or stimulus is added to the prey’s repertoire to assist 
it in avoiding the predator. New behaviours are able to subsume or suppress 
behaviours introduced at previous levels. 



3.1 The Hiding Bot 

First we consider a very simple bot. The hiding bot looks for food when it is 
hungry and hides in the shelter when it is not. It does not detect an approaching 
predator. The hiding bot’s survival strategy is basically to spend as much time 
in the shelter as possible, while avoiding starvation. 
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The hiding hot uses one stimulus with which to make its decisions: the in- 
ternal energy level (E). It uses this information to choose between one of two 
behavioural modes in which it can operate: 

Forage. The prey searches for food and eats it. The prey follows the odour 
gradient emitted by the food until it reaches an item of food which it then 
eats. The prey travels at half speed to conserve energy. 

Hide. The prey moves to the shelter and hides there when it arrives. The prey 
travels at half speed to conserve energy. 

Figure la shows the hiding bot’s control architecture. 



3.2 The Running Away Bot 

The running away bot actively keeps an eye out for the predator while it is 
outside of the shelter. The bot operates in the same way as the hiding bot in 
that it ventures out of the shelter for food when it is hungry, but if while the 
bot is out of the shelter it detects the predator it will scurry back to the shelter 
for safety. 

The running away bot may use the behavioural modes of the hiding bot: hide 
and forage, and in addition may use: run away mode. 

Run Away. The prey runs to the shelter at maximum velocity. 



To determine when to run away, the bot uses the stimulus predator fear. It is 
dependent on the distance {dpred) between the prey and the predator as shown 
in equation 2. 



^pred 

P = Ape ~ 



(2) 



The algorithm uses to control the running away bot is shown in figure lb. 



3.3 The Memory Bot 

The hiding bot and the running away bot are stateless. The memory bot explores 
what advantage a simple piece of state information can give an agent. [6] demon- 
strated that adding even a few memory bits can give a significant improvement 
to the performance of a reactive object tracking system. 

The memory bot remembers when it last saw the predator. This affects the 
stimulus memory fear. When the predator is seen memory fear instantly rises to 
the maximum. It decays exponentially with time (tpred) from when the predator 
was last seen as shown in equation 3 



^pred 

M = AmC ^ (3) 

The memory bot operates in the same way as the running away bot but if its 
memory fear is still above a threshold Tm then it will continue to hide. Figure Ic 
shows the control algorithm. 
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Fig. 1. The architecture of the prey, (a) Hiding bot. By default the bot hides; if the 
bot is hungry, the forage behaviour takes control of the actuators, (b) The running 
away bot. Run away takes control of the actuators if the predator is seen, (c) The 
memory bot. If the predator was seen recently the forage behaviour is suppressed, (d) 
The dodging bot. Dodge takes precedence over all other behaviours. 



3.4 The Dodging Bot 



The dodging bot has a reflex action with which it attempts to evade the predator 
if it gets too close. 

If the predator fear stimulus crosses a threshold To then the prey will go 
into dodge mode: 



Dodge. The prey immediately changes to a new direction which is orthogonal to 
the direction of the approaching predator. The prey very rapidly accelerates 
to maximum velocity. (It is in effect a jump to the side.) 
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This manoeuvre is somewhat analogous to the escape reflex in the crayflsh in 
that (1) a simple test is used to determine when to activate the response, and 
(2) to be effective a special piece of hardware is needed. The crayflsh makes use 
of special flexors in the abdomen which are only used in an escape tailflip; this 
gives it extremely rapid acceleration. The dodging bot uses a higher amount of 
acceleration in the dodge manoeuvre than it normally has available to it. The 
dodge manoeuvre is also somewhat similar to the zig-zagging used by the evasive 
agents in the [18] simulations. 

The dodge behaviour takes precedence over all other behaviours. Figure Id 
shows the control layer diagram of the dodging bot. 

4 Results 

The four bots were placed in the environment to see how well they survived. 
Different configurations were tried for each bot by varying the thresholds used 
to make the decisions. 

Each configuration was tested a thousand times. For each configuration the 
following statistics were recorded: 

1. The mean number of time steps that the bot survived. 

2. The median number of time steps that the bot survived. 

3. The standard deviation of the survival time. 

4. A tally of the causes of death, i.e., on how many runs the bot ran out of 
energy (starved), on how many runs the bot was caught by the predator 
(killed) and on how many runs the bot reached the end of the simulation 
still alive (survived). 

Refer to Appendix A for the values of constants used in the simulation. 



4.1 The Hiding Bot 

Since the hiding bot is unable to detect the predator its behaviour is governed 
only by the hunger threshold, which determines when it will hide in the shelter 
and when it will venture out to look for food. Figure 2a shows how the survival 
time and cause of death vary as the hunger threshold (Th) changes. If the hunger 
threshold is very low, then the bot will wait until it is almost completely out of 
energy before venturing out to look for food. There is a very high chance that 
the bot will starve to death before finding the food. If the hunger threshold is 
very high, then the bot will spend most of its time out of the shelter looking for 
food and there is a greater chance of it being eaten by the predator. The bots 
that do the best are the ones that stay in the shelter as long as possible while 
still having a good chance of finding food in time before starving. 

Figure 2a shows how the survival rate of the hiding bot varies by adjusting 
the hunger threshold. A curious feature of the graph in figure 2a is that as the 
hunger threshold goes from 0 to about 2, the median declines slightly but the 
mean rises. The explanation of this phenomenon is at very low hunger thresholds 
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(c) 



(d) 



(e) 



Fig. 2. Plots of the survival times and the causes of death versus the hunger threshold 
for each bot: (a) the hiding bot, (b) the running bot, (d) the memory bot and (e) the 
dodging bot. (c) The survival times and cause of death versus memory threshold for 
the memory bot {Th ~ 6). 



when the bot ventures out it is more likely that the bot will starve than that 
it will find food. While most of the bots die of this cause the median will stay 
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low. However the lucky few who find food live significantly longer so push up 
the mean. 

4.2 The Running Bot 

The running bot survives significantly better than the hiding bot as can be seen 
in figure 2b. 

After seeing the predator the running bot is generally effective at running 
back to the shelter in time before the predator overhauls it. Provided that the 
hunger threshold is set at a reasonable level the death of the running bot gen- 
erally occurs under one of three circumstances: (1) the bot is too far away from 
the shelter when the predator sees it and is therefore rundown by the predator 
in pursuit, (2) the predator approaches so that it is between the bot and the 
shelter; in this case the bot is cut off while running to safety, and (3) the bot is 
chased before it reaches any food so it is even shorter of energy when it gets back 
to the shelter; this causes the bot to die of exhaustion rather than being killed 
by the predator directly. Furthermore, the running bot suffers from having no 
memory and a shorter seeing distance than the predator. If the bot is chased to 
the shelter and is still hungry it will venture out as soon as it can no longer see 
the predator. But it is very likely that the bot will be still within the predator’s 
seeing distance and therefore is immediately chased again. 

4.3 The Memory Bot 

The simple state information provided by the memory fear stimulus gives the 
memory bot a big boost in survival chances. The memory bot is able to run back 
to the shelter and stay there long enough until the predator has passed. Since 
the predator moves randomly, the longer it stays in the shelter the greater the 
chance that the predator will have passed. Balanced against this is that staying 
longer in the shelter reduces the chance of having enough energy to reach new 
food before it starves. The bot will stay in the shelter until the memory fear (see 
equation 3) drops below the threshold Tm- If the bot has a very high threshold 
it will stay in the shelter only a short time. If the bot has a very low threshold it 
will stay in the shelter a very long time. Figure 2c shows how the survival rate 
is affected by the value chosen for the memory threshold. Note that the survival 
rate is fairly even for thresholds between 0.2 and 0.6, reflecting that the tradeoff 
between waiting out the predator and risk of starvation is fairly evenly balanced. 

Figure 2d shows how the survival rate is affected by the hunger threshold for 
the memory bot. The optimal hunger threshold is greater for the memory bot 
than for the running bot. This can be explained by two factors: (1) the memory 
bot has got a better chance of surviving an encounter with the predator so it can 
risk spending more time out of the shelter and (2) if the memory bot collects 
more energy then it can spend more time in the shelter after it has been chased 
by a predator, giving it higher chance of the predator having left. 

The memory bot is still killed if it is too far away from the shelter when 
pursuit begins or if the predator is in between it and the shelter. It is fairly 
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successful in waiting in the shelter until the predator has passed. It sometimes 
starves while waiting for the predator to pass and sometimes is unlucky and 
finds the predator still waiting outside after it has waited. 



4.4 The Dodging Bot 

The dodge manoeuvre of the dodging bot allows it to survive in one of the 
situations where the memory bot is frequently killed. If the predator attacks the 
prey coming from the direction of the shelter, the dodging bot is able to evasively 
side-step the predator. The predator coming at full speed is unable to adjust its 
direction in time to catch the prey. The dodge manoeuvre is less successful in 
evading the predator chasing from behind. However as with the running bot and 
the memory bot, in most cases when the prey is being chased from behind it 
will be able to reach the shelter in time. It is usually only when the prey is very 
far from the shelter that it is caught in pursuit. 

Figure 2e shows how the survival rate of the dodging bot varies with the 
hunger threshold. The dodging bot has a higher optimal hunger threshold again 
compared to the memory bot. 

5 Discussion and Conclusions 

A trend that emerges is the prey spends more time outside of the shelter foraging 
for food as it gets better at avoiding the predator. This phenomenon may be 
caused by two factors: (1) for the evasion tactics to be advantageous the prey 
needs enough energy to do the evasive manoeuvres and still collect food, and (2) 
because the prey is more likely to survive an encounter with the predator it can 
afford to spend more time outside the shelter foraging for food, thereby reducing 
its risk of starvation. 

To optimise its survival time the prey needs to balance its primary task, 
collecting food, with taking evasive action. If the predator were nonexistent the 
optimal survival strategy for the prey would be to spend all its time collecting 
food to nullify any risk of starvation. In the presence of a predator, collecting 
food becomes a risky task. This causes the bots with poor predator avoidance to 
wait until they are really hungry before they will venture out to look for food. 
As the prey is equipped with more evasive capabilities the risk in collecting 
food diminishes. This increased confidence allows the prey to act more like it 
would if there were no predator. As more evasive capabilities are added, the 
prey’s behaviour pattern (when it is not taking evasive action) approaches what 
it would be if the predator did not exist. 

A more general implication of this result is that having separate subsys- 
tems to deal with dangerous situations allows an agent to be less obstructed 
in undertaking its primary activities. Robots presumably have a set of primary 
tasks which they have to perform. However some robots while undertaking their 
work robots may periodically have to face obstructions or even dangers. A hy- 
pothetical example is a rescue robot sent into a burning building that may have 
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to dodge falling debris. Having separate subsystems to deal with obstructions 
may allow robots to be minimally affected in the way they achieve their primary 
tasks. Animals have dedicated neural circuits to deal with unexpected hazardous 
situations. [3,13] 

One of the biggest improvements given to the prey in this simulation is the 
addition of some simple memory information. In this memory model the time 
that the predator was last seen is used implicitly as a predictor of the predator’s 
present proximity. 

The fixed hierarchical model used for the agents seems suboptimal for this 
scenario. One kind of behaviour should not always have precedence over another. 
For example because hiding in a shelter after seeing a predator has a higher 
precedence than foraging, in some cases the bot would starve to death in the 
shelter. A better model would be more flexible: wait longer in the shelter if energy 
reserves are relatively high, and shorter if energy reserves are low. Different 
actions need different precedence at different times. 

There is biological evidence that in animals the precedence of actions is much 
more flexible. [21] review biological findings about parts of the vertebrate brain 
and argue that the basal ganglia acts as a central decision making point for 
arbitrating between conflicting actions. They argue that a similar specialised 
switching mechanisms might be employed in layered robot architectures (such 
as [4]) to provide more flexible action selection. 

In the crayfish the giant command neurons responsible for triggering the 
escape response are modulated by other parts of the nervous system [22,11,9]. 
The trigger threshold adjusts according to various circumstances, such as during 
feeding and restraint [24] and also adjusts according to longer term conditions 
such as the mating cycle and social dominance [26] . 

Edwards [8] proposes a model for behavioural choice in crayfish that uses 
mutual inhibition amongst the neural command centres. In Edwards’ model 
there is one command neuron for seven different behavioural modes. Each neu- 
ron receives excitatory stimuli from sensors. Each neuron is able to inhibit other 
command neurons and also receives inhibitory signals from the excited command 
neurons. After summing the excitations and the inhibitions, the command neu- 
ron with the greatest excitation wins. Edwards’ model is able to give actions 
different precedence at different times. An attempt was made to write a bot 
based on Edwards’ architecture for the scenario described in this paper. Prelim- 
inary results indicate that in this scenario it performs slightly better, but the 
results are inconclusive. 

The scenario examined in this paper is very specific. Previous pursuit-evasion 
experiments [5] have shown that effective evasion strategies are often very sen- 
sitive to the parameters of the environment. Future work may consider what 
kind of escape measures work best when faced with different kinds and variable 
sources of danger. The prey currently uses ‘magic’ perception to get the position 
of the predator. This is unrealistic. In future simulations the prey will have to 
infer the presence of a predator from noisy sensors. 
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The work presented in this paper is a preliminary step in exploring the role 
of integrating evasive actions in the context of doing other activities. As robots 
move out of the laboratory into more hostile environments, handling evasive 
actions may prove an important component of the robot architecture. 



A Constants Used in Simulation 



Constant 


Value 


Equations used 


A 


0.001 


(1) 


B 


0.03 


(1) 


Ap 


1 


(2) 


L 


40 


(2) 


Am 


1 


(3) 


T 


15 


(3) 
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Abstract. We present a uniform nonmonotonic solution for the prob- 
lem of reasoning about action on the basis of argumentation-theoretic 
approach in a series of paper. This paper is the hrst one in which we 
solve the frame and the qualification problems in a simplifying setting 
without domain constraints or ramifications. Our theory is provably cor- 
rect relative to a sensible minimisation policy introduced on top of a 
temporal propositional logic. 



1 Motivation and Introduction 

The need for a good reasoning about action formalism is apparent for research 
in artificial intelligence (AI). Alongside the logicist point of view to artificial 
intelligence, more recently, there emerges the cognitivist and situated action- 
based approaches (see [10] and the references therein). The latters provide some 
immediate and practical answers to certain issues of AI. The current problem 
domains for (Soccer) Robot Cup seem to be an area where these approaches 
promise to gain fruitful results. On the other hand, the logicist approach aims 
at long term solutions for the general problems of AI. From a logicist approach, 
formalising dynamic domains for reasoning about action can be realised within 
a logical knowledge representation. The general idea is that intelligent agents 
should be able to represent all kinds of knowledge in a uniform way such that 
some general problem solver can fully employ and find a solution based on their 
knowledge. As it turns out, there are difficulties with such a general approach to 
AI. Consider the task of formalising dynamic domains in some logical language. 
To formalise the dynamics of an action (or event) in a language with n fluents^, 
one will need to axiomatise not only about the fluents that are effected by the 
action but also about those that are not. Essentially, it requires that n axioms be 
asserted. Such a formalisation can hardly be considered a good representation. 
Hence, there is the need to solve this problem in logic-based reasoning about 
action formalisms. This is the well known frame problem as introduced by Mc- 
Carthy and Hayes ([15]). Moreover, there is still a problem in axiomatising the 
effects of an action, called the effect axiom. A logical axiomatisation requires that 

^ fluent is a technical term referring to functions or predicates whose values can be 
varied relative to time. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 519—531, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




520 



Q.B. Vo and N.Y. Foo 



the conditions under which the effects will take place after executing the action 
be precisely speciified. However, there are potentially infinitely many such con- 
ditions, some of which the reasoner may have never thought about. No realistic 
formalisation would ever be able to exhaustively enumerate all of those condi- 
tions. Nonetheless, to start a car, most people only worry about whether they 
have the key to that car. They never bother checking whether there is something 
blocking the tailpipe or checking all electric circuits to make sure that they are 
all well connected. Such a story has long been well-known within the community 
of common-sense reasoning, in particular reasoning about action. This is known 
as ther qualification problem and was introduced by McCarthy (cf. [13]). 

While there have been a number of solutions to the frame problem (see e.g. 
[19], [16] and [3]), the qualification problem has largely been ignored. Some peo- 
ple argue that the frame problem is already very challenging and it would be 
a good approach to thoroughly solve the frame problem before complicating a 
formalism with the qualification problem. We argue that there is a danger of ap- 
proaching these problems from that point of view for (at least) two reasons: (1) 
it may be very hard to come up with a uniform solution for all problems: while 
many existing solutions for the frame problem are monotonic (e.g. [3] and [16]), 
the qualification problem inherently requires a non-monotonic solution; and (2) 
these solutions of the frame problem can only succeed under some precise as- 
sumptions: 

- Actions always succeed. This is the action omniscience assumption. More pre- 
cisely, this assumption dictates that the qualification problem is skipped. 

- Fluents change if and only if the reasoner knows that there exists an action 
that possibly changes its value. This can be termed as domain omniscience as- 
sumption. It assumes that the reasoner has complete (ontological) knowledge 
about the domain on which he is reasoning about. 

The above two reasons are of course closely related as (1) arises due to the 
underlying assumptions in (2) which can no longer hold once the qualification 
problem is taken into consideration. 

In this paper, a uniform nonmonotonic solution for the two most basic prob- 
lems of reasoning about action is proposed. Basically, when performing common 
sense reasoning, the reasoner is based on a number of plausible assumptions. E.g., 
assuming that an instance of birds flies, or assuming that shooting a turkey with 
a loaded gun causes it to die, etc. The proposed representation formalism aims at 
making these assumptions explicit so that an automated reasoner is conscious 
(at least) about what assumptions it relies on when performing reasoning. It 
is also the basic idea of assumption-based frameworks which are at heart of 
Bondarenko et al.’s (1997) argumentation-theoretic approach. As a first step to- 
wards a comprehensive framework, we show how the frame and the qualification 
problems are solved in the absence of domain constraints and ramifications. 



2 Domain Descriptions 

We introduce a propositional action description language based on a more com- 
prehensive representation formalism proposed by Sandewall (1994). In particu- 
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lar, we extend Drakengren and Bjareland’s (1999) language so that it is possible 
to describe narratives in our framework. 

2.1 Syntax 

Following Sandewall’s, the underlying representation of time is a {discrete) time 
structure T = (T, <, +, — ) consisting of 

• a time domain T whose members are called timepoints which are integers in 
this paper; 

• — are as usual for integers. 

Given a time structure T = (T, a signature with respect to T is 

a tuple cr = {T,iF,A), where T is a set of timepoint variables, IF is a set of 
propositional fluent names, and Fl is a set of action names. We assume that all 
sets in <j are countable. We denote T ={“'/ | / G A member of F* = F\JF 
is affluent literal. Moreover, A = Ao^lVA, where is the set of domain dependent 
action names, called basic actions, e.g. load, shoot, etc. and T>A = {da,p \ (p G 
F*} is the set of dummy actions. 

For each fluent literal (p GIF*, we introduce the following two propositions: 
AQ,^, and F A,^. AQ,^, is associated with the assumed qualifications upon the 
preconditions of an action regarding the fluent p. F A,p is associated with the 
frame assumptions regarding ip. Given a set of fluent literals F ^F* , we denote 

FAr {FA^ {‘PGF} and AQr {AQ^ \ & F}. 

A timepoint expression is one of the following: 

• a member of T, 

• a timepoint variable in T, 

• an expression formed from timepoint expressions using + and — . For conve- 
nience, we will also write and t~ instead of r -|- 1 and t — 1, respectively. 

We denote the set of timepoint expressions by TF. 

Definition 1 Let cr = (T,F,A) be a signature and t,v G F£, f G F, A G A, 
R G {=,<}, 0 G {A,V,^,^}. Define the basic (domain description) language 
A over a by: 

A ::=T I F I / I [r, v]A \ tRv | | Ai G A2 | [t]A, 

and the assumption base AB by: 

AB = ABaq U ABfa^ where ABaq = {[t,v]AQ,^ | t, u G T£ and ip G F*} 
and ABfa = {[t]F A, p | r G TF and ip G F*}. 

The domain description language L (over cr) is defined: £ = A U AB. 

[r, u]A means the action A is performed during the time interval [t,v\. 
[t,v]AQ,p means the fiuent literal ip is assumed to be qualified to hold by the 
end of the interval [t, u]. [t]F A^ p means the fiuent literal ip is assumed by default 
to persist from the time point r to the next, i.e. the principle of inertia. 

A formula that does not contain any connectives (i.e. A, V, and [.]) 

is atomic. If 7 is atomic and tG TF, then the formula 7, [t]7, [r]7, ->7, ^[r]7, 
and [t]^7 are literals. 

Let 7 be a formula. A fiuent f G F occurs free in 7 iff it does not occur within 
the scope of a [t] expression in 7. r G TF binds / in 7 if a formula [rj-if occurs as 
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a subformula of 7, and / is free in ip. If no fluent occurs free in 7, 7 is closed. If 
7 does not contain any occurrence of [r] for any t gTS, then 7 is propositional. 

2.2 Semantics 

Definition 2 Let cr = (T, IF, A) be a signature. A state over ct is a function 
from IF to the set {T, F} of truth values. A history over cr is a function h from 
T to the set of states. A valuation is a function (j) from T£ to T. A narrative 
assignment is a function 77 from T x A x T to the set {T, F}. In addition, we define 
Eq : T X AQj:* X T ^ {T, F} and Sf : T x F — > {T, F}. An interpretation over 
CT is a tuple (/i, (j>, rj, Eq,ef) where /i is a history, (/) is a valuation, 7 is a narrative 
assignment and Sq,£f are defined as above. 

Definition 3 Let 7, 5 G A and / = {h,4>,rj,eq,£f) an interpretation. Assume 
T,v G T£, f G F, A G A, R G {=,<}, ip G F*, 0 G {A, V, and 

X G {T, F}. Define the truth value of 7 in J for a timepoint t G T, denoted 1(7, t) 
as follows: 

= X 

Iif,t) = hmf) 

I{[T,v]A,t) = T]{t,A,v) 

I{[t, v]AQq,, t) = £q{T, AQq,, v) 

I{[T]FAq„ t) = e/(r, FA^,) 

I{rRv,t) = 4 >i.T)R 4 >{v) 

I{^® 5 ,t) = ® I{d,t) 

I{[TbF) = 

Two formulas 7 and 6 are equivalent iff I{j,t) = I{S,t) for all I and t. An 
interpretation / is a model of a set F C A of formulas, denoted / ^ F, iff 
I{"f,t) = T for every t G F and 7 G F. A formula 7 G A is entailed by a set 
F C A of formulas, denoted F [= 7, iff 7 is true in all models of F. 

Definition 4 Let / = {h,(f>,ri,£q,£f) be an interpretation. The set Occ^ = 
{(t, A,m) G T X a X T I r]{t,A,u) = T} is called action occurrence denotation 
of I. The set FA^ = {{t,FAq,) gT x FAj^* \ Sf{t,FAq,) = T} is called FA- 
denotation of /. The set AQ^ = {(t, AQ^, u) gT x AQjp* xT | £q(t, AQ^,, m) = T} 
is called AQ-denotation of I. ■ 

2.3 Background 

Bondarenko et al. (1997) propose a unified framework for default reasoning called 
argumentation-theoretic approach which we will use as the underlying inference 
mechanism for our system. We reproduce the relevant definitions from Bon- 
darenko et aVs work for completeness. 

A deductive system is a pair {C,TZ), where 

-C is a formal language consisting of countably many sentences, and 
-TZ is a set of inference rules of the form 

Oil , , O-n 



a 
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where a, , . . . , a„ G and n > 0. 

Any set of sentences T C £ is called a theory. A deduction from a theory T 
is a sequence Pi, , P^n, where m > 0, such that, for alH = 1, . . . , m, 

- Pi eT, or 

- there exists in TZ such that cti, . . . , a„ G {Pi, . . . , Pi-i\. 

- T Oi means that there is a deduction from T whose last element is 

a. Th^c,n){T) is the set {a G £ |T a}. Since the language £ is generally 

kept fixed whereas the set of inference rules TZ is likely to vary depending on the 
description of the domain, when there is no possible confusion we will abbreviate 
\~{c,n) and Thi^c.n) as \~n and Thn, respectively. Thus the classical inference 
relation h can also be written as \~nc where TZc is the set of inference rules of 
classical propositional logic. Note also that every set of inference rules considered 
in this paper will be a super set of TZc. 

Given r = € TZ, we will also denote prem{r) = {ai, . . . , «„}, the 

premises of r, and cons{r) = a, the consequence of r. 

Definition 5 [2] Given a deduction system (C,TZ), an assumption-based frame- 
work with respect to (C,TZ) is a tuple {T,Ab,~ ) where 

- T,Ab C C and Ab yf 0, 

is a mapping from Ab into £, where a denotes the contrary of a. 



Definition 6 [2] Given an assumption-based framework {T,Ab,~ ), 

- a set of assumptions A C Ab attacks an assumption a G A& iff T U Z\ \~tz o, 

- a set of assumptions A C Ab attacks a set of assumptions A' C Ab iff 
A Q Ab attacks some assumption a & A' . 

As assumptions are expressed in terms of usual propositions, we will replace 
the notion of contrariness “ in Bondarenko et al.’s system with the classical 
negation ^ and omit it from the specification of assumption-based framework. 

3 Reasoning about Action with Argumentation-Theoretic 
Approach 

In the rest of this paper, we introduce a uniform framework for solving the 
frame and qualification problems using the frame and qualification assumptions. 
General solutions for the frame and the qualification problems can be obtained 
by computing plausible sets of assumptions which guarantee that extensions 
computed from plausible sets of assumptions will be consistent when the given 
theory is consistent. 

Definition 7 A deductive system (£, TZ) is well-defined iff for each subset S C 
TZ, if the set prem{r) is consistent then the set CONS{S) = {cons{r) \ 
r G S'} is also consistent. 

We will assume that a deductive system is well-defined. Being formalised in 
terms of the argumentation-theoretic approach, the representation requires an 
extended notion of consistency. 
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Definition 8 Let (£, TZ) be a deductive system, (a) a set of sentences F C £ 
is TZ- consistent iff F I/t^F; (b) an assumption-based framework (T,Ab,^) with 
respect to {£,TZ) is consistent iff T is 7^-consistent. ■ 

Definition 9 Given an assumption-based framework {T,Ab,^), a set of as- 
sumptions A C Ab rejects an assumption a G Ab iff (a) A does not attack 
itself, and (b) A U {a} attacks itself. 



Observation 1 Given an assumption-based framework {T,Ab,—) and a set of 
assumptions A C Ab, if A attacks an assumption A then A rejects a. 



We are interested in the assumptions which are rejected by a given set of 
assumptions without being attacked by that set. 

Definition 10 Given an assumption-based framework {T,Ab,~')^ a set of as- 
sumptions A C Ab leniently rejects an assumption a G Ab iS (a) A rejects a, 
and (b) A does not attacks a. 



dc. f 

We denote Lr{A) = {a G Ab \ a is leniently rejected by A}. 

The frame assumptions are the essence of the inertia problem, and their role in 
the argumentation approach is illustrated below by the Yale Shooting Problem 
(YSP) [9]. In this formalisation we intentionally ignore the qualification problem 
(it is addressed in the next section) to highlight how the frame problem is solved. 
We consider a well-worn example to motivate our approach to the frame problem. 
Example 1 



ADysp = {[T,v]load {[v]loaded f\ ^[t]F A-, ioaded) , 

([r, v]shoot A [r] loaded) ^ {-^[v\alive A ~^[t]F A aUve) , } 

The following rules representing the frame assumptions are added: 

FRySP = r [d^o°^ed,[T]FAio„ded [T]alive,[T]F Agu 
^[T\loaded,[T]F A^load„d 



[r'^]loaded 



’]alive 



■]FA- 



[r+jaZiiie 



4 



^T'^\loaded ’ — i[r+]aHi;e 

Given a theory Tysp = {[0]alive, [0, Ijload, [1, 2]wait, [2, 3]shoot}, the argu- 
mentation-theoretic approach will yield the following preferred set of assump- 
tions (cf. [2]): Aysp = {[T]FAioaded I T G Tme} U {[t]F A^ioaded I T G Fme} U 
{[T\FAalive I T G Tmc} U {[t\F A^ alive \ T G Tme} \ {[0]F A^ioaded, [‘2]FAalive}- 
They give rise to the following preferred extension (cf. [2]): Th{TYSpi->{[T]loaded 
I T > l}£{[l]alive, [2]alive}U {^[T]alive \ r > 3}UZ\). This extension is also the 
stable extension and well-founded semantics (cf. [2]) of the given theory under 
the argumentation-theoretic approach. Note that in case one would like to be 
uncertain about whether the gun is still loaded after the shooting action, one 
just simply needs to add an axiom: [r,v]shoot ^[t]F A ioaded to dictate that 

the persistence of the fluent loaded after the action shooting is not guaranteed. 
In that case, we can still derive that [r]loaded for r = 1, 2, but we can no longer 
give a definite assertion about [r]loaded for r > 3. □ 

As the above formalisation of YSP resembles that using default logic, it may 
be surprising that the problem of unintended models pointed out by Hanks 
and McDermott for circumscription, default logic, autoepistemic logic does not 
happen here. The principal reason is the interaction of the inference rules and 




Solving the Qualification Problem 525 



the notion of attack in the argumentation-theoretic framework, which invalidates 
undesired assumptions. Notice that even if ^[2]loaded can be (magically) derived, 
it cannot lead to -^[l]F Aioaded- Therefore, the set of assumptions corresponding 
to this case does not satisfy the conditions of preferred set of assumptions, thus 
ruling out this unintended model. 

We adopt the following guidelines in seeking for a sensible solution for the 
problems of reasoning about action: 

• The derived pieces of information don’t conflict with the given facts; 

• Occurrences of events are minimised; and 

• The inertia of fluents is maximised though the minimality of the event occur- 
rences will be of higher priority. 

However, while the preferred model semantics copes successfully with the 
YSP, it can not properly account for the explanation problem, e.g. the Stanford 
Murder Mystery, the Stolen Car Problem. The subtlety lies in the derivation of 
the contrary of the frame assumption F A^. The contrary of a frame assumption 
is derived only when both the occurrence of the event that brings about the 
change (absent in the Stolen Car Problem) and the preconditions required to be 
satisfied for the change to actually take place (absent in the Stanford Murder 
Mystery) are explicitly derivable. This is where the notion of (leniently) rejected 
assumptions is called into service. 

Definition 11 Given an assumption-based framework (T, -■), a set of as- 

sumptions A C Ab is presumable iff (a) A = {a G Ab \T LI A L-jz a} (in 
Bondarenko et al.’s terms, A is closed), (b) A does not attack itself, and (c) for 
each assumption a ^ A, a is rejected by Z\. I 

Definition 12 Given an assumption-based framework (T,Ab,F)^ a set of as- 
sumptions Z\ C is plausible iff (a) A is presumable, and (b) there exists no 
A' C Ab such that A' is presumable and Lr(A') C Lr{A). ■ 

4 Technical Framework 

Aside from the trivial case of occurrences of events causing the frame assump- 
tions to be rejected, two aspects of events can be distinguished: 

(a) An event happens but the change it is supposed to cause does not take 
place. We call this expectation failure and this is more or less the qualification 
problem; and 

(b) No events that are known to cause a change happen but the change does 
take place. We call this surprise and this is usually known as the explanation 
problem. 

The following assumption represents our underlying intuition behind reason- 
ing about action formalisms. 

Assumption 1 Intuitive models contain minimal (with respect to set inclusion) 
sets of surprises. □ 

Definition 13 Let a = (T,F,A) be a signature. Assume t,v G TE, A G A, 
R G {=,<}, L A, and cp G F* . A domain description D is defined to be a 
tuple {C,TZ,AB,F)i where: 
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1. £ is the domain description language and AB an assumption base over cr; 

2. TZ = TZc U TZp U TZa U TZq, where 

(a) TZc is the set of inference rules of (classical) propositional logic; 

(b) is the set of frame-based inference rules of the form: ^ j 

those that represent the frame axioms in terms of inference rules; 

(c) TZa is the set of action descriptions which are inference rules of the form: 

i'iiose that represent the conditions for an action to bring 
about some effect on a fluent; and 

(d) TZq is the set of qualification-based inference rules of the form: ^ ^aq ’ 
i.e. those that represent the (dis-)qualifications regarding the fluent literal 

£• 

3. The theory F C A. ■ 

Given a set of assumptions A, we denote ApA = An ABpA and Aaq = 
A^ABaq- 

Definition 14 Let a = {T,TF,A) be a signature and D = {C,TZ,AB,F) a 
domain description over a. An interpretation I = {h,4>,r],eq,ef) is a model 
of D iff 

1. / is a model of F ; 

2. for each r £ TZ, ii I \= prem{r) then / |= consfr). I 

The following definition captures one of several aspects of the (model- 
theoretic) solution of the frame problem. This aspect is known as the action- 
oriented frame problem in Lin and Shoham’s (1995) terms. The proposed min- 
imisation policy formalises the intuition that change does not happen by itself 
but is caused by some kind of event. Thus, for each fluent, if its value is changed 
between two timepoints r and v, (at least) an occurrence of some event must 
end at v that brings about that change. 

Definition 15 Let D = {C,TZ, AB, F) be a domain description and I a model 
of D. I is a, coherent model of D iff 

F for each basic action a € Ao and t,v G TS, if F ^ [r,v]a then / ^ [r,v]a; 
and 

2. for each (p G F* and r G T£, if / \= [t](p A then either (i) there is 

Ag Ao and vi,V 2 G T£ such that V 2 = r+, and r = ^ 

TZ, and I ^ prem{r), or (ii) I |= [T,T'^]da^,f,. ■ 

Thus, in a coherent model (1) all satisflable basic actions must follow from 
the given theory, and (2) all changes are attributable to events. 

Given an interpretation I, we want to extract the sets of assumptions satis- 
flable in F 

Definition 16 Let cr = {T,F,A) be a signature and / = (/i, </>, ry, £g, £/) an 
interpretation over a. The set of frame assumptions satisflable in I, denoted 
is deflned as follows: Ap^ = {[t]F A^ \ (T,FAq,) G FA^} and the set of 
qualiflcation assumptions satisflable in I, denoted is : A^^q = {[r, 

I (r, AQp, v) G AQ^}. We also write Agp = A^^^q U Apj^. ■ 
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Conversely, given a theory F and a set of assumptions A, a reasoner can also 
construct his models about the domain of interest. 

Definition 17 Let D = {C,TZ, AB, F) be a domain description and A C AB. 
A model / of D is A-relativised iff 

F for each a e AB, I j= a iff a e A; and 

OcA = OAd U DAS{A), where: (i) OAd = {{t,A,u) G T x Ao x T 
I F \= and (ii) DAS{A) = {{t,da,p,A) G T x T>A x T | 

\t]F A,p ^ A and there exists no action A G Ao such that the following hold: 
G 7^, and v = t+ , and F\J A\= {<F, [t, v]A, [t, v]AQ^,^}}.m 

4.1 The Frame Problem 

Initially we address the frame problem in a simple setting viz. without qualifi- 
cations, but will lift the restrictions later. 

Definition 18 Let D = {C,TZ, AB, F) be a domain description. D is a simple 
domain description, or S-domain, iff TZq = 0 and AQ does not occur any where 
in TZ or F. ■ 

Definition 19 Let D = {£, TZ, AB, F) a domain description. An interpretation 
/ = {h,(j>,r],eq,ef) is a simple model, or S-model, of D iff 

1. / is a model of D\ and 

2. £q{t, AQq,,u) = T for every {t, AQq,,u) G T x AQjp. x T. ■ 

This effectively isolates the frame problem from the qualification problem. 
Note also that if / is an S-model then = ABaq- A coherent S-model is an 
S-model which is coherent. 

Example 1 (continued.) Let Dysp = (£,FRysp,AB,F) be an S-domain for- 
malising the YSP scenario, where F = ADysp U Tysp. The following is part of 
one of the coherent models of Dysp- 

{[0, l]load, ^[Ojloaded, [l]loaded, [0]alive, [l]alive, 

[1.2] wait, [1, 2]da^ioaded,^[‘2]loaded, [2]alive, 

[2. 3] shoot, A^^oaded, [3]alive}, 

which corresponds to one of the anomalous models of this scenario (the one 
pointed out by Hanks and McDermott). □ 

But it is not desirable to admit the occurrence of an event when there is no 
evidence for it. Thus we need to minimise the set of action occurrences in a given 
action theory. 

Definition 20 Let D be an S-domain. A coherent S-model / of D is a prioritised 
minimal model (or simply PMM) of D iff there does not exist any coherent S- 
model F of D such that OcA C Occ^ . ■ 

Note that the above model-theoretic minimisation policy isn’t based on the 
frame assumptions. This solution to the frame problem is thus amenable to well- 
known techniques such as circumscription^, but we believe an argumentation- 
theoretic approach is not only more direct but has wider applicabilty. In order to 

^ in combination with the introduction of occurences of dummy actions. 
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provide the connection between the above (model-theoretic) minimisation policy 
and the (argumation-theoretic) notion of plausible sets of assumptions we need 
to maximise the set of assumptions satisfiable in a PMM. 

Definition 21 Let D be an S-domain. A PMM / of D is a canonical prioritised 
minimal model (or simply CPMM) of D iff there does not exist any PMM /' of 
D such that F C F . ■ 

We now want to see how the account of plausible sets of assumptions connects 
to this account of minimality. 

Theorem 1 Let D he an S-domain. If I is a CPMM of D then Aqp is plausible. 

We now prove that not only can we derive a plausible set of assumptions 
from a given CPMM but we can also construct CPMMs from a plausible set of 
assumptions of a given S-domain. 

The set of Z\-relativised models of an S-domain D is denoted as Mod^{D). 

Observation 2 Let D be an S-domain and A a set of assumptions of D. For 
each I € Mod^{D), A = Aqp. □ 

Theorem 2 Let D = {£, TZ, AB, F) be an S-domain and A C AB. A is plausible 
iff Mod^{D) yf 0 and for each I G Mod^{D), I is a CPMM of D. □ 

Theorem 3 Let D be an S-domain. Furthermore, suppose that CPMM{D) is 
the set of CPMMs of D and Plaus{D) is the set of plausible sets of assumptions 
of D, then CPMM{D) = Mod^iD). □ 

4.2 Solving the Qualification Problem (in the Presence of the 
Frame Problem) 

The results reported in the previous section are established in a simple setting. 
If we add the following observation to the theory in example 1: [3]alive, i.e. after 
the shoot action, the victim is still alive, then like most existing formalisms, the 
above account of plausibility would come up with a contradiction. In fact, it 
would be more reasonable that such a failure is explained as an occurrence of 
some (dis-)qualification. In this section, we remove certain restrictions on the 
qualifications of actions in order to achieve a more general framework. 

There are some subtleties in the way action theories are represented in our 
proposed assumption-based framework. Note first that there is a potential diffi- 
culty if frame assumptions and qualification assumptions are not distinguished, 
which can be illustrated by a version of the YSP. Consider the following action 
description: 

J [T]alive,[T]FAalive [T]loaded,[T.,v]shoot,[T,v]AQ^alive \ (— 'TP 
^ [r~^]alive ’ —i[v]aliveA—>[r]FAaiive ^ — 

{[0]loaded, [0]alive, [0, l]shoof} C F. 

From this, we have (at least) two stable set of assumptions: one contains 
the frame assumption [Q]F AaUve which rejects the qualification assumption 
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[0, V\AQ^aiive and another contains [0, l]AQ^aiive which attacks [0]FAaiive- Only 
the latter is intuitive in this case but we do not have any explicit criterion to 
prefer one over another. 

Given the presence of several kinds of assumptions, i.e. frame and qualifica- 
tion, we will adopt the following convention: we will write Lrp(A) instead of 
{Lr{A))p for P G {FA^AQ"\. Since we no longer exclude qualification assump- 
tions from our assumption-based domain descriptions, we will simply refer to 
assumption-based domain descriptions as Q-domains. 

Definition 22 Let D = (£, 7^, AB^ F) be a Q-domain. A presumable set of 
assumptions A C Ab is semi-Q-plausible iff Ltpa{A) is minimal (with respect 
to set inclusion). ■ 

Definition 23 Let D = {C,TZ, AB, F) be a Q-domain. A set of assumptions 
A C Ab is Q-plausible iff (a) A is semi-Q-plausible, (b) Aaq is maximal, i.e. 
there does not exist any A' C Ab such that A' is semi-Q-plausible and Aaq 
C A!^q, and (c) ApA is maximal relative to the above two conditions, i.e. there 
does not exist any Z\' C Ab such that A' satisfies the above two conditions and 
ApA C A'pj^. I 

We will now refer to models of a Q-domain as Q-models. A coherent Q-model 
is a Q-model which is coherent. We minimise the set of action occurrences in 
coherent Q-models of a given action theory. 

Definition 24 Let D be a Q-domain. A coherent Q-model / of D is a prioritised 
minimal Q-model (or simply PMQM) of D iff there does not exist any coherent 
Q-model /' of D such that Occ^ C Occ^ . ■ 

Definition 25 Let D be an S-domain. A PMQM / of D is a canonical prioritised 
minimal Q-model (or simply CPMQM) of D iff (a) there does not exist any 
PMQM F of D such that AQ^ C AQ^ , and (b) there does not exist any PMM 
F of D such that FA^ c FA^ . ■ 

Now we can proceed to results for CPMQMs regarding Q-plausible sets of 
assumptions which are similar to those for CPMMs regarding plausible sets of 
assumptions. 

Theorem 4 Let D be a Q-domain. If I is a CPMQM of D then Agp is Q- 
plausible. □ 

Similar to the previous section, we now prove that not only can we derive a 
plausible set of assumptions from a given CPMQM but we can also construct 
CPMQMs from a plausible set of assumptions of a given domain description. 
The set of Z\-relativised models of a Q-domain D is denoted as Mod^{D). 
Observation 3 Let D he a Q-domain A a set of assumptions of D. For each 
I G Mod^{D), A = A^Qp. □ 

Theorem 5 Let D = {C,TZ,AB,F) be a Q-domain and A C AB. A is Q- 
plausible iff Mod!^{D) ^ 0 and for each I G Mod!^{D), I is a CPMQM of 

D. 
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Theorem 6 Let D he a Q-domain. Furthermore, suppose that CPMQM{D) 
is the set of CPMQMs of D and Plaus^{D) is the set of Q-plausihle sets of 
assumptions of D, then CPMQM{D) = {JAePiausQ(D) Mod^{D). □ 

Q-plausible sets of assumptions allow one to overcome scenarios in which 
expectation failures (or, qualification surprises) arise, e.g. shooting a turkey with 
a loaded gun and it can be observed that the turkey is still alive. When such 
surprises arise, the reasoner knows who’s to blame: qualification assumptions. 
He can then accordingly remove the “guilty” assumptions. 



5 Related Work 



The frame problem has been addressed in numerous research papers formalised 
under various frameworks for reasoning about actions, including the Situation 
Calculus (see [17]), the Event Calculus (see [19]), a temporal logic introduced 
by Sandewall (see [18]), the action language family (see [8]), the Fluent Cal- 
culus (see e.g. [20]). Attempts to solve the original version of the qualification 
problem (in contrast to the narrowed version of this problem as introduced by 
Ginsberg and Smith [7] and Lin and Reiter [11]) include Kvarnstrdm and Do- 
herty’s work in tackling the qualification problem in a version of the temporal 
logic introduced by Sandewall. The solution proposed in this work, however, is 
still largely fragmented from the solution to other problems of reasoning about 
actions such as the frame and the ramification problems. A more uniform solu- 
tion to the qualification problem in accordance to other accounts of reasoning 
about actions is introduced by Thielscher [21] for the Fluent Calculus. The so- 
lution proposed by Thielscher is based on a monotonic solution to the frame 
problem. The idea with Thielscher’s solution to the frame problem is similar to 
the idea behind the STRIPS problem solver. The fluents that hold in a state will 
be manipulated by rules that add (resp. delete) certain fluent (literals) from the 
preceding state in order to obtain the resulting state. On the other hand, the 
solutions to the ramification problem and the qualification problem rely on the 
causal expressions. The idea is to exploit the directional characteristic of causal 
expression to eliminate the unintended models (aka. the anomalous models). 
The solution to the qualification problem is non-monotonic while the solution 
to the ramification problem remains monotonic. Thielscher’s argument in favour 
of this approach is largely due to the fact that minimisation of abnormalities in 
the traditional way as originally performed by McCarthy under circumscription 
[14] leads to anomalous models. However, as pointed out by Baker [1], a clever 
minimisation policy will overcome the problem. For a more formal analysis of 
the related issues from a system-theoretic point of view, the reader is referred 
to Foo et aVs paper [6]. 

Our solution is distinctive from the above approaches in the sense that it 
offers solution to the major problems of reasoning about actions in a uniform 
manner. With the introduction of explicit assumptions and the use of reasonable 
arguments, only intended models should emerge and allow the reasoner to arrive 
at correct conclusions about the dynamic world. 
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6 Conclusion 

We developed a uniform framework for reasoning about action using an 
argument-ation-theoretic approach (more precisely, assumption-based approach) 
in a series of papers. The present paper is the first of this series in which we have 
presented how our framework copes with the frame and the qualification prob- 
lems in a simple setting without indirect effects or domain constraints. We have 
shown how our framework can be naturally extended to become more and more 
expressive. 
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Abstract. There is growing agreement that dialogue management is 
critical to speech enabled applications. This paper describes a novel ap- 
proach to knowledge acquisition in the natural language processing do- 
main, and shows the use of techniques from cognitive task analysis to 
capture politeness protocols from a “dialogue expert.” Acknowledging 
the importance of intentions in mixed initiative systems, our aim was to 
use an off-the-shelf Belief, Desire, and Intention (BDI) framework from 
Agent Oriented Software to provide the planning component, and intro- 
duce plan library cards as a means of capturing expertise in this context. 



1 Introduction 

Being able to hold a conversation with a computer has been a dream of AI 
research from the very beginning when Turing proposed what has become known 
as the Turing Test. It turned out to be harder than expected, and in this year 
when HAL was to be on his way to Jupiter, the GUI is still the primary means of 
interfacing to a computer, and call centres employ people to answer telephones. 

Two things have changed in recent years that make dialogue more attractive 
as a research area. First, with the rise of the call centres, there is more research 
funding available, not only for speech recognition, but also for the software that 
decides what to say, and when to say it. Second, the research community now 
accepts there will be no silver bullet, and that a working AI system will require 
a concerted effort by a team of people doing sometimes dull things. 

The work described here is part of an ongoing project to create a conver- 
sational agent using the beliefs, desires, and intentions (BDI) architecture in- 
troduced by Rao and Georgeff [1]. BDI systems fall within a long tradition in 
AI of modelling human decision making by selecting plans from a plan library 
to match current goals. Populating that library is a key issue, and this paper 
describes our approach to this task. 

For the last ten years the natural language processing (NLP) community has 
been using corpus analysis as its primary data acquisition tool. This approach 
collects a large body of naturally occurring text, and then uses tools such as sta- 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 532—544, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 




Dialogue Modelling for a Conversational Agent 533 



tistical models [2,3] or sequential analysis [4] to infer things about text in general. 
In this paper we introduce a different approach to knowledge acquisition. We use 
a technique called Cognitive Task Analysis (CTA) in which a subject matter ex- 
pert (SME) is interviewed to discover their thought processes while performing 
a task. Similar techniques have been used to populate the rule bases of expert 
systems [5,6] and Cognitive Work Analysis has been used to develop software 
agents for system simulation [7]. Mitchard [8] used Cognitive Task Analysis to 
create BDI models of human decision making in the air operations domain, and, 
following on from Mitchard, we use Applied Cognitive Task Analysis [9,10] to 
elicit knowledge from our dialogue expert. 

Our SME’s task — let us call her KT — was to take bookings for company 
cars by telephone. Booking cars is, naturally, of little direct use to the Australian 
Defence Forces and, like many other tasks, is more conveniently done with a GUI. 
This particular task should be seen as the pilot study for a more useful embodied 
conversational agent performing data access on behalf of decision makers. 

As far as dialogue is concerned, we find that expressions like “OK,” “Yea,” 
“I see,” and “Really” not only ground knowledge in the shared space, but can 
also fulfill the goal of encouraging the other party to say more. This technique 
is, we claim, key to ATT’s strategy for being polite. 

2 Background — The BDI Architecture 

Beliefs, Desires, and Intentions, have long been used as an framework for em- 
bedded systems. Bratman’s original aim [11] was to describe resource-bounded 
decision making. Architectures based on his writing provide a way to balance 
planning and reactive behaviour. It provides a model of making decisions with 
partial knowledge of the environment, and with insufficient time to make the 
best decision. 

Since it was first introduced, the BDI approach has found a niche in the 
software agent community. Two common themes in the definition of “agent” are 
autonomy, which suggests goal driven behaviour, and a separation between the 
agent entity and its environment. The environment, being outside the control of 
the agent, provides inputs to which the agent may want to react. BDI is designed 
to pursue goals while at the same time exploiting opportunities as they arise. 

A second reason BDI is closely linked with software agents is that, like 
SOAR [12], it is a candidate model of human cognition. For many years Air 
Operations Division at DSTO have been using BDI agents to implement the hu- 
man element in simulations [13]. Such simulations involve classic software agents 
with a complex task, and programming them is non-trivial. Domain experts are 
often brought in to verify the behaviour of agents, and these SMEs tend to find 
the BDI scripts intuitively clear. Why? One explanation is that the BDI ap- 
proach explicitly models how we humans think others think. It can be seen as 
an implementation of the folk psychological view that a rational agent will do 
what it believes is in its interests. This understanding is so ingrained in us hu- 
mans that it is often difficult to see why it is interesting, hard, or even useful [14]. 
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Using Dennett’s example [15], seeing two children tugging at a toy, we know they 
both want it. We reason about other minds in terms of mental attitudes, and 
the BDI approach attempts to capture models of decision making at that level 
of abstraction. When pilots look at a BDI plan in Air Operations simulations, 
what they see makes intuitive sense because it describes what they would expect 
another pilot to do. Writing agents in terms of BDI utilises our inbuilt human 
ability to understand, in a common-sense way, other people’s behaviour. 



3 Background — Dialogue 

Probably the most infamous dialogue system is Weizenbaum’s Eliza [16]. This 
system was implemented using quite a simple procedure; the text is read and 
inspected for the presence of a keyword. If such a word is found, the sentence is 
transformed in accordance with a rule associated with the keyword, if it is not, 
a content-free remark or an earlier transform is retrieved. The text from this 
retrieval or transform is then printed out as the reply. Since 1966 when Eliza 
first appeared, there would appear to have been general agreement in the AI 
community that, although an interesting curiosity, the technique Wiezenbaum 
used did not bear on the nature of dialogue. Although pattern/ action rules could 
implement a Rogerian psychologist, that role was seen as simply an interesting 
exception with little relevance to more general skills that would allow a machine 
to, for instance, book cars. 

Much of the work on dialogue since then has concentrated on text genera- 
tion. This kind of dialogue is often described as goal driven and is known as 
discourse planning. Consider writing this text. As authors, we have a goal to 
convince the reader of something and some plans and sub-plans on how to do 
it. The text planning process can be modelled as a hierarchical set of goals that 
bottoms out with the production of words on the page. Dialogue, by contrast, 
involves multiple agents who can interrupt and block each other’s goals. It has 
the added complexity of continual plan failure and re-evaluation — something 
BDI was explicitly developed to handle. Research on the interactive nature of di- 
alogue includes work on the way the “common ground” is developed between the 
participants [17], the nature and role of obligation, and what Allan calls “prac- 
tical dialogue” systems [18]. Research on the latter emphasizes the way people 
use language to cooperatively solve problems. This is seen as not only practi- 
cal, but also significantly simpler to achieve than general human conversational 
competence. The work described here falls squarely in this last camp. 

As mentioned above, the primary tool of the NLP community is corpus analy- 
sis. In the case of dialogue, a popular approach is for researchers to use sequential 
analysis [4] and mark up transcripts with dialogue moves [19,20], or rhetorical 
devices [21]. This is the set of dialogue moves from a research project for a major 
Telco: 



REQUEST-SERVICE, OFFER-SERVICE, EXPRESS-PROBLEM, ASK-DETAILS, 
CHECK, ACCEPT-REQUEST, REFUSE- REQUEST, GIVE-DETAILS, COR- 
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Child’s plan #176 
goal: eat 

precondition: near mum 
trigger: hungry 

actions: 

tell mum ”Fm hungry” 

get her to approve 

ask her what I can have 

if I like it, continue 

else post goal ”eat chocolate” 

get it 

eat it 

Fig. 1. The outline of one of a child’s plans for getting food. 

RECT-INFO, ECHO, ACKNOWLEDGE, PARDON, HOLD, FULFILL, SOCIAL, 

UNCLASSIFIABLE 

Although these types of speech act may seem straight forward, the reliability of 
the mark-up process still raises questions about the validity of many such tag 
sets. Better classifications and more effective training and instruction manuals 
are a hot research topic. 

Probably the theory of dialogue structure that comes closest to a BDI ap- 
proach, is that of dialogue as dialogue games [22,23,24]. 

Here is an example from Mann [23] introducing dialogue games: 

1 I’m hungry. 

2 Did you do a good job on your geography homework? 

3 Yeah. What’s to eat? 

4 Let me read it. What is the capital of Brazil? 

5 Rio de Janeiro. 

6 Think about it. 

7 It’s Brasilia. Can I eat now? 

8 I’ll let you have something later. What is the capital of Venezuela? 

9 Caracas. 

10 Fine. 

11 50 what can I eat? 

12 You want some cereal? 

13 Sure. 

14 O.K. 

In this dialogue between a mother and child, the child’s desire to eat is only 
satisfied after mum has checked the homework. At line 1 the child instantiates 
a plan, something along the lines of that in Figure 1, with the goal of eating. 
At line 2 Mum has a different goal: to check the child’s homework. At line 3 the 
child tries to stonewall Mum’s question, and continues on with her plan. Mum is 
having none of that, and continues the “check homework” game. At line 7 there 
is evidence that the child has a plan to wear Mum down — the strategy is that 
if the child asks often enough. Mum will get sick of saying no. At line 8 Mum 
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explicitly tells her that the wear-Mum-down game is not going to work (“I’ll 
let you have something later”) and at line 9 the child has abandoned that plan. 
With line 10, Mum is indicating that her plan to check homework is finished and 
the child returns to her plan to get something to eat. 

Dialogue games, in contrast to dialogue moves, are explicitly goal based, 
longer term, and succeed, fail or are abandoned. Dialogue games are consequently 
not as explicitly “in” the text, and coding schemes that mark up intentions of 
the speaker have been found to be unreliable. Rather than looking for games 
in transcripts, we introduce the idea of explicitly asking a “dialogue expert” 
about the dialogue games they use. Before looking at the study however, it is 
informative to consider exactly what it is the study intends to achieve. 



3.1 Mixed Initiative and Politeness 

Mixed initiative is often seen as the “Holy Grail” in the quest for better dialogue 
systems. Our primary premise is that a BDI architecture will provide the control 
structure to enable a mixed initiative dialogue. The concept of a dialogue game 
describes what the required BDI plans would look like, and ACTA provides the 
tools to populate the plan library. It is still not clear what kind of thing we are 
looking for however. In human to human conversations, why does initiative shift 
from one participant to another? When can a participant propose a new goal 
and when are they obliged to stick with the current one? The hypothesis is that 
politeness is a key motivation in initiative shift in human dialogue. Politeness is 
not just a matter of saying please and thank you. Brown and Levinson in their 
seminal work [25] list 30 or so universal strategies for maintaining the “face” 
of conversants. Interestingly many of these strategies are goal based and so, for 
instance, if a conversant expresses a desire for X, positive face can be expressed 
by the other person if they also consider X desirable. 

The importance of getting politeness right is perhaps demonstrated by the 
Microsoft Paper Clip. It goes without saying that Mr. Clipit is (was) not popular, 
but on examination it appears to work quite well as a mechanism for accessing 
the Microsoft help system. So why the user reaction? One explanation is that it 
is not playing the social games we expect rational agents to play. On reflection, 
it appears that the Microsoft Paper Clip is annoying rather than ineffectual. 

If user satisfaction is a product of both effectiveness and social skills, it is in- 
structive to consider whether social skills can compensate for poor effectiveness. 
Evidence from our study suggests this is the case. The car booking scenario can 
be seen as a slot filling task (in the cases were the caller wanted to book a car 
— see below) in which the aim of the conversation is to fill in a form with five 
or so slots: name, destination, time, duration, and contact details. One measure 
of effectiveness in this context is the proportion of data provided by the caller 
that makes it into the appropriate slot. KT's error rate can be measured as the 
number of times the caller provides a piece of data that KT does not pick up, 
divided by the number of pieces of slot fill data provided. Going through the 
transcripts, it turns out that she misses 20% of the data callers provide. Keep 
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in mind that KT was approached for these experiments because she is recog- 
nised as being good at her job, and although user satisfaction was not explicitly 
measured, there seems little doubt people were happier dealing with KT than 
they would have been working with a machine with a 20% fail rate^ This has 
significant consequences for organisations that want to improve user satisfaction 
with their speech enabled systems. 



4 A BDI Model of KT 



We wanted to look at KT booking cars over the phone as a pilot study for an 
intelligent assistant project in the Division. Given time limits, car bookings were 
not going to give enough samples from our Division alone, and so we approached 
Electronic Warfare Division for assistance. As a carrot we promised a carton of 
beer (funded from our own pockets) for the Division that made the most phone 
calls. The beer becomes important. 

A separate recording telephone was installed in ATT’s office and email sent to 
both Divisions asking people ring that number to book cars rather than doing 
it through the existing Outlook calendar. Over two weeks there were 25 calls, 2 
of which were taken by a stand-in operator while KT was away. 

KT was told that the aim of the exercise was to look at politeness and that 
she would be interviewed after the data was collected to see if we could identify 
her goals and procedures, and what cues she used to select them. The tapes 
were transcribed, and this shows a transcript (with names changed) of one of 
the more successful calls that gives a feeling for the car booking process. When 
looking at transcripts, bear in mind that a dialogue that seems perfectly natural 
and comprehensible when spoken, can appear quite awkward when transcribed.. 



1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 



Morning ITD KT speaking 
Morning KT, it’s PD again 
Hello, how are you? [laughter] 

Can I book the car for 10 o’clock again please? 
Yes, which one was it that you like? 

Okay, ZKJ292 
292. Urn for 10.30? 

No, no. 10 till 12 

10 till 12. And is it to go to the same place? 
Yes, same place. Elex Adelaide 
Not a problem. I’ll put it in. 

Thanks for that KT 
Okay, thank you, bye 
Bye 



^ The reason KT misses data is of course the limitations of human memory and 
attention when trying to use Outlook and hold a conversation at the same time. 
Computers of course do not have these limitations. 
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As usual in AI, the straightforward cases are not interesting; it is the exceptions 
that require common-sense and where AI systems let us down. 

4.1 The Knowledge Elicitation Process 

Of the various tools under the ACTA banner, it seemed inappropriate to use 
the Knowledge Audit probes. Dialogue management skills are primarily skills 
we humans do not need to think about when we use them, and so it seemed 
inappropriate to ask KT what basically she would think was “obvious.” Using 
the transcripts and preliminary interview data, Das and Wallis used their “naiVe” 
understanding of dialogue to produce a Task Diagram overview of the task and 
to identify the cognitively interesting components of the task. Figure 2 provides 
the sub-tasks that help frame the car booking dialogue process. 





Fig. 2. The Task Diagram for booking-a-car dialogues. 





Task diagrams bear a strong resemblance to state transition diagrams, which 
have been used by some to represent the structure of dialogue for a particular 
application. Although at this level of description there is a natural order to the 
sub-tasks, elaborating on the nature of the add-booking-details reveals no such 
restriction. 

The next stage in the analysis was to use techniques from the Critical De- 
cision Method (CDM) and ask KT why she did things when she did, and to 
identify her goals when performing some action, her procedures for achieving 
goals, and the cues she used to initiate procedures and goals. These issues are 
explored in the context of a “story” and the transcripts provided the context for 
the interview. In effect the approach was naturalistic observation with supple- 
mental interviews. Phase one was to go through the transcripts and make a first 
pass at the BDI plans that would implement the necessary dialogue games for 
the car booking task. Given a set of plans, we could then interview KT using 
probes for CDM to check and develop the model. 

4.2 An Interesting Transcript 

Going through the transcripts, the very first call caused problems with identi- 
fying the goal. It was from a person who had already booked a car but rang 
anyway. We suspect the caller was after the beer, but KT (being nice) thinks 
he just wanted to help with the experiment. 

Before looking at the transcript, keep in mind that KT is expecting callers 
who want to book a car. Figure 3 shows what happens. 
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1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 



Good afternoon ITD, KT speaking 

Oh good afternoon, I have booked a car for tomorrow, a divisional car; 
[Right] (1) 

[I](l) have to ring you here? 

Yes 

So I booked a COMMS division car, ZKJ292 for 9.30 till 12.00 

9.30 to 12.00 

We are going to Adelaide 

And it was the ZKJ? 

Yeah. 292 

292. And what was your name? 

Ah PD 

Right and your extension number? 

97313 

97313. Um did you want to just wait while I um check that it’s available? 

I have booked it [not clear] I did this this afternoon before I got the message 

[laughter] Okay 

Okay 

Not a problem 
That’ll be okay? 

Thank you 

Okay, thank you KT 

Yes, bye 

Bye 



Fig. 3. Transcript No. 1 — the caller wants the beer. 



What is happening in this conversation? Has KT not heard the past tense in 
the callers opening statement? According to our model, what was going on here 
is that KT has no plan that fits with the situation. The initial view was that 
she was simply going with the plan she had, and getting the details in order to 
make a booking - a booking she knew, at some level, she was not going to have 
to make. 

There were other cases where the model did not fit neatly with the tran- 
scripts, but this paper concentrates on this particular case as it is the most 
general, and demonstrates how we used CTA in the context of dialogue. 



4.3 The Interview 

The interview threw a new light on the situation. We used probes similar to 
those in O’Hare et al. [26]. Looking at the transcripts, KT was asked things like 
“What were your specific goals when you said this?” and “What else might you 
have said at this point?” 

When asked what was going on in the transcript in Figure 3, she said that 
she was thinking “Oh no! what am I going to do here!” She pointed out that she 
was aware the car was already booked and that indeed she had used the past 
tense on line 9. There was no intention to get all the details for a car booking. 
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and even when pressed she would not state an actual goal that would fit with 
the Dialogue Games approach. So what motivated her responses? If she had 
decided to go with the plan she had, shouldn’t she have been able to say as 
much? One might posit subconscious goals, but that would not be in keeping 
with using BDI as a model of cognition. It seemed that KT uses BDI for goal 
based behaviour, but when all else fails, she has a plan — enabled and disabled 
by the BDI mechanism — that simply fills in and encourages the caller to say 
more. In the same way as Eliza hands the initiative back to the user, it seems 
KT’s goal, for her first 2 or 3 responses at least, is simply to encourage. 



At some point in this dialogue — about line 9 perhaps — she has developed 
a new plan to add to her plan library. Here is a call, the next day, from some 
one from ITD who is also after the beer: 



1 

2 

3 

4 

5 

6 

7 

8 
9 



Good afternoon ITD KT speaking 

G’day, my name’s AD, I’m also in ITD, over in 

Oh yes 

Um, we’ve just booked a car 
Right 

And ah we got that e-mail, so, uh can we do that ah [laughter] terrible thing? 
[laughter] Um yeah. Can I just go through it with you and just check that 
you’ve got it booked okay? 

Yep, sure 

Is that alright? Um which car were you, did you just book? 



Some time between line 9 of the first call, and this call, KT has created a 
plan to confirm someone’s booking if they have already made a booking with 
Outlook, but ring up anyway. 

We conclude that a key mechanism for human dialogue is the ability to hand 
initiative back to the other person and simply encourage the other person to say 
more. Eliza’s success relied upon exploiting this social protocol to the hilt. In a 
BDI model of dialogue, one plan — in fact the default plan for when a goal is 
not identified — should be to encourage the user to say more. 

Figure 4 is a caller ringing to cancel a booking with KT’s stand-in. At line 7 
PP has no idea what to do with the caller and, we propose, is simply encouraging 
the caller to say more. Similarly at lines 13 and 15. Once again, at line 9, there 
is a tendency to go with whatever plan is even partly appropriate, but it is not 
clear how this would generalize. In this case PP is likely to have a plan with a 
strong link between the cue of a registration number being said, and bringing up 
the appropriate Outlook entry. There is also a very low cost to doing this, and 
there is also a tendency for people to want more information. All of these may 
contribute to PP apparently going with the book-a-car plan when the caller 
obviously doesn’t intend to. 

Here is a case where KT cannot recognize the destination, and uses the 
encourage technique: 
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8 
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10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 



Good morning, customer service point, PP speaking 

Oh, um I’m ringing for KT actually 

Yes, KT is 

Car bookings, yeah 

Yep, I can take that for you 

Okay, fine. I’ve just had a car out 

Yep 

A CD car, ZKJ292 

One moment, Pll just bring that up. Sorry, the car number was? 

Ah ZKJ292 

Yep, and your name was? 

Ah PD. I’m back from Adelaide now, so the car can be reused, like. 
Okay? 

Okay 

Yep 

Okay I didn’t need it as long as I thought 
Eighty oh 
Okay, thanks 

Thank you for letting us know 
Bye-bye 



Fig. 4. KT’s stand-in using the encourage strategy. 



1 

2 

3 

4 

5 

6 
7 



... and where were you going to he going? 

Ah the, it’s called the UWB facility 

UWB 

Yeah 

Facility 

Which is on the RAAF Base., and also be going to store 2 
Okay and do you know where the keys are for the car? 



Imagine a more direct approach — popular in computer interfaces — that 
“helpfully” suggests the known options: 

1 ... and where were you going to he going? 

2 Ah the, it’s called the UWB facility 

3 The availahle options are ... 



There can be no doubt KT's approach is dramatically more polite. 



4.4 Knowledge Representation 

Having analysed the data, the conclusion from the analysis needs to be written 
down. This is what Klien refers to as “knowledge representation” and Militello 
and Hutton [9] recommend using a cognitive demands table to sort through and 
analyse the data. For each situation, the table lists the cues and strategies used 
by the expert, and the common errors a novice might make. In our case the 
target BDI architecture requires that cues and strategies be associated with 
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procedures. To this end we introduce Plan Library Cards (PLCs) which map 
directly into BDI plan structures. The use of cards was inspired by experience 
with CRC^ cards as used in the software engineering community. Figure 5 shows 
some of the more obvious PLCs for the car booking task. Each card represents a 



Goal: takeCall 
Cue: phone rings 

(^^^SayHelk^ (^^felpCaUe^^^'^^^ 




Goal: bookCar 

Cue: call says s/he wants to book a car 
(l^ntifVC^~^openEntry)'^U^^ 








( confirm) 






Goal: enterName 

Cue: have Outlook open && name slot empty 




Goal: enterDestination 
Cue: Outlook open && destination empty 


(askNamey'^^]^^ ^([^^^erUerl^ 


C[^)^^^tinatiorr)r~^co^^ 









Fig. 5. Four example Plan Library Cards (PLCs) for the car booking task. 



procedure; the goal it might achieve; and the cues which determine when it can 
be used. Note that using the BDI approach, multiple plans might be relevant at 
any instant but only one is used, and that a procedure can fail or be abandoned 
at any point — there is no guarantee of completion. 

To walk through a transcript, the cards are grouped by goal. When the 
speaker adopts a goal, the appropriate pile of cards becomes active. Each active 
pile is then searched for a card with matching cues, and the procedure is exe- 
cuted. That is, in our case, things are said and subsidiary goals are posted. As 
new cues are discovered, either by looking at the transcript or by interview, they 
are added to the appropriate card. New cards can be introduced as required and 
the process repeated until a satisfactory description of the dialogue process is 
obtained. 

Once the analysis is complete the next step is to apply it. Although it would 
have been nice to implement a phone based car booking system as a demon- 
strator, we did not have the appropriate resources to do this. We have however 
been working on the parts of the system that would be portable to other do- 
mains. One such component is a Java Speech API [28] based implementation of 
dialogue which allows “barge-in” statements like those seen in the car booking 
transcripts. Turning PLCs into an operational system is straight forward using 

^ Class-Responsibility-Collaboration cards. See UML Distilled [27, pp64-66] for details. 
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Agent Oriented Software’s product Jack [29] and marrying Jack to the speech 
system is under way. 

5 Conclusion 

This paper introduces the use of techniques from Cognitive Task Analysis for 
knowledge elicitation in the context of BDI systems for dialogue. Intentions are 
explicitly modelled in a BDI approach, but intentions are hard to capture with 
more conventional corpus techniques. 

We found that one strategy our SME uses is to encourage the other person 
to say more. It is used when our expert has no plan for furthering shared goals. 
Such a strategy is more polite than those currently in use in human computer 
interfaces, and as such would appear to be able to improve user satisfaction 
independently of system effectiveness. 
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Abstract. Lazy Bayesian Rules modifies naive Bayesian classihcation 
to undo elements of the harmful attribute independence assumption. It 
has been shown to provide classification error comparable to boosting 
decision trees. This paper explores alternatives to the candidate elimina- 
tion criterion employed within Lazy Bayesian Rules. Improvements over 
naive Bayes are consistent so long as the candidate elimination criteria 
ensures there is sufficient data for accurate probability estimation. 
However, the original candidate elimination criterion is demonstrated to 
provide better overall error reduction than the use of a minimum data 
subset size criterion. 

Keywords: machine learning 



1 Introduction 

Naive Bayes [4] is a simple and efficient approach to classification learning that 
has clear theoretical motivation and support. It has been demonstrated to pro- 
vide competitive prediction error to more complex learning algorithms [8,11], 
especially when training set sizes are small [17]. 

Lazy Bayesian Rules (LBR) [17,18] modifies naive Bayes, seeking to retain 
its simplicity, efficiency, and clear theoretical foundations, while weakening the 
attribute independence assumption that can reduce naive Bayes’ prediction ac- 
curacy. LBR has been demonstrated to provide prediction accuracy comparable 
to boosting decision trees [18]. 

This paper describes naive Bayes and LBR. It then examines one of the com- 
ponents of LBR, the candidate elimination criterion by which LBR determines 
whether an attribute should be a candidate for factoring out of the attribute in- 
dependence assumption. Experiments demonstrate that improvements over naive 
Bayes are consistent so long as the candidate elimination criterion ensures there 
is sufficient data for accurate probability estimation. The original candidate elim- 
ination criterion is demonstrated to be better at determining when to stop than 
the use of a minimum data subset size criterion. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 545—556, 2001. 
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2 Naive Bayes 



Naive Bayes is motivated as follows. When classifying an instance X = 
Xi,X 2 , ■ ■ - Xn, whose class y is unknown, classification error will be minimized 
by selecting 

argmaXy{P{y\X)) (1) 

the class that is most probable given X. A problem arises where P{y\X) is to be 
estimated from the frequencies of X and y in a set of data T> = (Xi,yi), {X 2 , 2 / 2 ), 
. . . {Xk,yk}- In the limit, when the dataset contains the entire domain with 
respect to which probabilities are to be determined, 



P{W) = F{W) 



( 2 ) 



where F{W) is the frequency with which W occurs in T>. As P(W \ Z) = P{W A 
Z)jP{Z), P{y I X) might be estimated by the approximation 



P{y\x) 



FjyAX) 

F{X) 



(3) 



However, in many cases X and y A X will not occur frequently enough in the 
data for accurate estimation of the probabilities from the frequencies. In fact, 
unless the set of data is very comprehensive, X and y A X may not occur at all. 
In this context, Bayes rule 



P{y\x) 



P{y)P{X I y) 
P{X) 



(4) 



may be used to derive alternative probabilities, by estimation of which the target 
probability can be estimated. As P(X) is invariant across different values of y, 



P{y\X)^P{y)P{X\y) 



(5) 



and hence we need not estimate the denominator. However, this still leaves the 
problem of estimating P(X \ y) when y A X does not occur frequently in the 
data. By making the conditional independence assumption 

n 

P{xi,X 2 ,...Xn\y) = Y[P{Xi\y) ( 6 ) 

i=l 

P{X I y) can be estimated by estimation of each P{xi \ y), latter estimates being 
more reliable as each conjunct is likely to occur with relatively high frequency. 

Naive Bayes is classification using (1), estimating P{y\X) by (4) and (6). 
As (1) minimizes prediction error, naive Bayes will minimize prediction error 
except in so far as the conditional independence assumption is violated and the 
estimation from data of probabilities P{y) and P{xi \ y) is inaccurate. 

However, while the conditional independence assumption makes the estima- 
tion of P{X I y) feasible, and naive Bayes delivers competitive classification per- 
formance for small data sets, the independence assumption is likely to be violated 
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for many real world classification tasks. Notwithstanding Domingos & Pazzani’s 
[3] observation the such violations are harmless so long as they do not affect 
the relative rank of each estimate of P{y \ X), research into semi-naive Bayesian 
learning has demonstrated that such violations are frequent and that explicit 
actions to alleviate their effect can reduce error [6,7,9,10,11,12,13,14,16]. 



3 Lazy Bayesian Rules 



LBR utilizes an alternative to Bayes theorem (4), 



P{y I Zi A Z 2 ) 



P{y\Z2)P{Z^\y^Z2) 
P{Zi I Z2) 



(7) 



The derivation of this equality is given in Zheng & Webb [17]. Given that 
P{Zi \ Z 2 ) invariant across values of y, 



P{y 1 A Z 2 ) cx P{y \ Z 2 )P{Zi ] y A Z 2 ) . (8) 



Where Z\ is a conjunction of terms, Z\ = Z\l\Z 2 l\ - ■ ■ Zm, a conditional attribute 
indpendence assumption 



m 

P{Z^\yhZ2)-\{P{z,\yhZ-2) (9) 

i=l 

can be used to estimate P{Zi j y A Z 2 ). 

Like naive Bayes, LBR estimates P{y \ X) for each y, selecting the y that 
maximizes the estimate. LBR differs from naive Bayes by segmenting the con- 
juncts of X into two groups, Zi and Z 2 , and then using (7) in place of (4) and (9) 
in place of (6). Like naive Bayes, LBR will minimize classification error except 
in so far as its independece assumption is violated and the estimation of the 
required probabilities is incorrect. 

A principal advantage of LBR over naive Bayes is that its independence 
assumption is weaker. Whereas naive Bayes assumes independence between all 
conjuncts given the class, LBR assumes independence only between the conjuncts 
in Z\ given both the class and the conjuncts in Z 2 . 

The assumption of independence between fewer attributes is an advantage 
as fewer attribute interdependencies will be assumed incorrectly. 

The assumption of independence under stronger conditions is also a major 
advantage. Consider the conditions age > 70, senile, and nocturia. Each of 
these three conditions will be highly interdependent with the others, as senility 
and nocturia are both correlated with age. However, given age > 70, senile and 
nocturia may be independent, as the interdependence of senility and nocturia 
may solely result from the respective interdependencies with age. That is, while 
P {senile /\nocturia) yf P {senile) P {nocturia), P {senile /\nocturia \ age > 70) = 
P{senile \ age > 70)P{nocturia \ age > 70). If this is the case (and conditioning 
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on y does not produce independence between these attributes), 

P{y I age > 70 A senile A nocturia) ^ 

P{y)P{age > 70 | y)P{senile \ y)P{nocturia \ y) 

P{age > 70 A senile A nocturia) 

so naive Bayes will be inaccurate. However, LBR may be accurate because 



P{y I age > 70 A senile A nocturia) = 

P{y I age > 7Q)P{senile \ age > 70 A y)P{nocturia \ age > 70 A y) 
P{senile A nocturia \ age > 70) 



• ( 11 ) 



If these two advantages were the only consideration, it would be advantageous 
to factor out all conditional interdependencies by placing all attributes in Z 2 . 
However, placing an attribute in Z 2 carries one disadvantage in addition to its 
advantages. Each conditional probability P{Zi | y A Z 2 ) will be estimated by the 
approximation P{zi \ y A Z 2 ) « F{zi Ay A Z 2 )jF{y A Z 2 ). The more attributes 
in Z 2 the lower the frequency in T> of both Zi Ay A Z 2 and y A Z 2 and hence 
the lower the expected accuracy of the approximation. Hence, LBR engages in 
a process of seeking to balance gains in expected accuracy due to factoring out 
harmful attribute interdependencies against losses in expected accuracy due to 
decreased expected accuracy of estimation of the required parameters. 

LBR manages this trade-off by performing leave-one-out cross-validation once 
for each attribute- value using the conditional formula that results from including 
that value in Z 2 . An attribute- value v is only considered as a candidate if the 
number of examples misclassified by including v in Z 2 but correctly classified 
by excluding it is significantly lower than the number correctly classified by 
including it but misclassified by excluding it. A matched-pair binomial sign test 
with significance level 0.05 is used to assess significance. The candidate with the 
lowest error is selected and the process repeated until no candidates remain. 

LBR uses lazy learning. Calculation is performed when an object is to be 
classified. Only the attribute-values of that object are considered for inclusion 
in Z 2 - The algorithm is presented in Table 1. Note that this algorithm does not 
explicitly maintain Z 2 . Each Abest found is added to Z 2 - Z\ is the values of the 
attributes in Att for Etest- Z 2 is the remaining attribute values for Etest- The 
effect of factoring out Z 2 is achieved by selecting for Dtraining the subset of 
instances that satisfy the conditions in Z 2 ■ When the probability of an attribute 
value conditional on a class is estimated from a training set, the m-estimate [2] 
with m = 2 is used. When the probability of a class is estimated, the Laplace 
estimate [2] is used. When applying naive Bayesian classification, if two or more 
classes obtain equal highest probability estimates, one is selected at random. 



4 Alternative Candidate Elimination Strategies 

LBR eliminates from consideration as candidates for Ahest attribute values that 
fail to reduce error by a statistically significant amount using leave-one-out cross- 
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Table 1. The Lazy Bayesian Rule learning algorithm 



LBR,(j4tt, Dtraining-, 

INPUT: Att: a set of attributes, 

Dtraining'- a Set of training examples described using Att and classes, 
Etest'- a test example described using Att. 

OUTPUT: a predicted class for Etest- 
LocalNB = a naive Bayesian classifier trained using Att on Dtraining 
Errors = errors of LocalNB estimated using N-CY on Dtraining 
Cond = true 
REPEAT 

TempErrorsbest = the number of examples in Dtraining + 1 
FOR each attribute A in Att whose value va on Etest is not missing DO 
Dsubset — examples in Dtraining with A — Va 

TempNB = a naive Bayesian classifier trained using Att — {A} on Dsubset 
TempErrors = errors of TempNB estimated using N-CY on Dsubset + 
errors from Errors for examples in Dtraining — Dsubset 
IF {{TempErrors < TempErrorsbest) AND 

{TempErrors is significantly lower than Errors)) 

THEN 

TempNBbest = TempNB 
TempErrorsbest = TempErrors 
Abest — A 

IF (an Abest is found) 

THEN 

Cond = Cond A {Abest = VAtsst) 

LocalNB = TempNBbest 

Dtraining = Dsubset Corresponding to Abest 

Att = Att — {^6est} 

Errors = errors of LocalNB estimated using N-CY on Dtraining 
ELSE 

EXIT from the REPEAT loop 
classify Etest using LocalNB 
RETURN the class 



validation on the training data. The condition that enforces this strategy is set 
in bold type in Table 1. 

This approach was motivated by the desire to eliminate from consideration 
attribute values for which factoring out appears to reduce error only by chance. 
Inevitably different formulae will result in variability in prediction performance, 
and by chance some will perform better than others. By eliminating candidates 
for which the difference in performance was not significantly greater than the 
baseline performance, we reduce the risk of selecting an attribute value that 
appears to improve performance only by chance. By using leave-one-out cross- 
validation classification performance as the selection criterion we aimed to mea- 
sure the effect of both the improvement brought about by weakening the at- 
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tribute independence assumptions and the decrease in accuracy of estimation 
brought about by decreased data. 

Our previous experiments indicate that this strategy is very effective at man- 
aging this trade-off and results in very strong classification performance [17,18]. 
However, an alternative argument can be constructed that as the only harm in 
moving an attribute- value to Zi lies in the reduction in accuracy of estimation of 
the parameters, the candidate elimination strategy should be aimed directly at 
combating this problem. In other words, an attribute- value should remain a can- 
didate for inclusion in so long as there is sufficient data to reliably estimate 
the required parameters. 

This paper tests this proposal by substituting for the LBR candidate elimi- 
nation test (set in bold type in Table 1) an alternative test that is based solely 
on the number of examples in Dtraining that have the relevant value. This is 
predicated on the assumption that if there are sufficient examples of a given 
value, estimation of the frequency of that value and the probability of each class 
given that value will be sufficiently accurate for accurate classification. Three 
values are considered, 30, 100, and 500. The first value, 30, was selected as 30 is 
commonly held to be the minimum sample from which one should draw statis- 
tical inferences. The last value, 500, was selected as a sufficiently large number 
that accurate estimation of parameters should be possible. 100 was selected as 
an intermediate value. This new strategy was implemented by substituting the 
condition \Dsubset\ > MinSize for the candidate elimination condition set in 
bold type in Table 1, where MinSize was set respectively to 30, 100, and 500. 
This approach will default to naive Bayes when the dataset size is less than 
MinSize as all candidates will be eliminated. 



5 Experiments 

For the first experiment, naive Bayes and the four variants of LBR (the original 
candidate elimination criterion, called hereafter LBR, and candidate elimination 
using MinSize set to each of 30, 100, and 500, called hereafter MinSize = 30, 
MinSize = 100, and MinSize = 500, respectively). The 29 datasets from the 
UCI repository [1] were used that have been used in previous LBR experiments 
[17,18] (a selection based on those used in prior semi-naive Bayesian learning 
research). These datasets are described in Table 2. The experimental method of 
[18] was replicated, ten repetitions of three-fold cross-validation, with different 
random selection of folds during each repetition. Numeric attributes were dis- 
cretized using Fayyad & Irani’s [5] MDL discretization algorithm on the training 
data for a given fold. Each algorithm was evaluated with the same sequence of 
thirty training and test set pairs formed in this manner. 

The average error rates of each algorithm for each data set are presented 
in Table 3. Also presented for each algorithm is the mean error across all data 
sets, the geometric mean error ratio compared with naive Bayes, the win/loss 
record between the algorithm and naive Bayes, and the win/loss record between 
the algorithm and LBR. The mean error is a very gross measure of performance 
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Table 2. Description of data sets 



Domain 


Size No. of No. of Attributes 
Classes Numeric Nominal 


Lung cancer 


32 


3 


0 


56 


Labor negotiations 


57 


2 


8 


8 


Postoperative patient 


90 


3 


1 


7 


Zoology 


101 


7 


0 


16 


Promoter gene sequences 


106 


2 


0 


57 


Echocardiogram 


131 


2 


6 


1 


Lymphography 


148 


4 


0 


18 


Iris classification 


150 


3 


4 


0 


Hepatitis prognosis 


155 


2 


6 


13 


Wine recognition 


178 


3 


13 


0 


Sonar classification 


208 


2 


60 


0 


Glass identification 


214 


6 


9 


0 


Audiology 


226 


24 


0 


69 


Heart disease (Cleveland) 


303 


2 


13 


0 


Soybean large 


307 


19 


0 


35 


Primary tumor 


339 


22 


0 


17 


Liver disorders 


345 


2 


6 


0 


Horse colic 


368 


2 


7 


15 


House votes 84 


435 


2 


0 


16 


Credit screening (Australia) 


690 


2 


6 


9 


Breast cancer (Wisconsin) 


699 


2 


9 


0 


Pima Indians diabetes 


768 


2 


8 


0 


Annealing processes 


898 


6 


6 


32 


Tic-Tac-Toe end game 


958 


2 


0 


9 


LED 24 (noise level = 10%) 


1000 


10 


0 


24 


Solar flare 


1389 


2 


0 


10 


Hypothyroid diagnosis 


3163 


2 


7 


18 


Splice junction gene sequences 


3177 


3 


0 


60 


Chess (King-rook-vs-king-pawn) 


3196 


2 


0 


36 



as error rates on different domains are incommensurable, but provides an ap- 
proximate indication of relative performance. The geometric mean error ratio is 
the geometric mean of the value for each data set of the error of the algorithm 
divided by the error of naive Bayes. The geometric mean is more appropriate 
than the mean as an aggregate measure of ratio values [15]. The win/loss records 
with respect to naive Bayes and LBR list the number of domains for which the 
error of the algorithm is lower than the error of, respectively, naive Bayes and 
LBR. 

The first point of interest is that LBR has scored slightly fewer wins and 
slightly more losses with respect to naive Bayes than in previous experiments 
[17,18]. However, it is notable that all of LBR’s losses to naive Bayes occur with 
smaller data sets. The largest is credit screening, containing 690 examples, and 
for which the training set size in three-fold cross-validation will be 430. It is also 
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Table 3. Error rates 



MinSize 





NB 


LBR 


30 


100 


500 


Lung cancer 


0.534 


0.544 


0.534 


0.534 


0.534 


Labor negotiations 


0.098 


0.098 


0.105 


0.098 


0.098 


Postoperative patient 


0.378 


0.386 


0.383 


0.378 


0.378 


Zoology 


0.059 


0.059 


0.063 


0.059 


0.059 


Promoter gene sequences 


0.109 


0.112 


0.170 


0.109 


0.109 


Echocardiogram 


0.296 


0.297 


0.306 


0.296 


0.296 


Lymphography 


0.182 


0.182 


0.196 


0.182 


0.182 


Iris classification 


0.066 


0.066 


0.065 


0.066 


0.066 


Hepatitis prognosis 


0.144 


0.144 


0.175 


0.144 


0.144 


Wine recognition 


0.023 


0.023 


0.030 


0.023 


0.023 


Sonar classification 


0.245 


0.245 


0.240 


0.248 


0.245 


Glass identification 


0.238 


0.237 


0.240 


0.246 


0.238 


Audiology 


0.277 


0.277 


0.290 


- 


0.278 


Heart disease (Cleveland) 


0.171 


0.171 


0.200 


0.177 


0.171 


Soybean large 


0.143 


0.101 


0.149 


0.115 


0.143 


Primary tumor 


0.534 


0.535 


0.568 


0.551 


0.534 


Liver disorders 


0.361 


0.363 


0.359 


0.355 


0.361 


Horse colic 


0.208 


0.199 


0.197 


0.192 


0.208 


House votes 84 


0.100 


0.067 


0.086 


0.057 


0.100 


Credit screening (Australia) 


0.146 


0.147 


0.166 


0.154 


0.146 


Breast cancer (Wisconsin) 


0.026 


0.026 


0.041 


0.034 


0.026 


Pima Indians diabetes 


0.252 


0.251 


0.267 


0.253 


0.252 


Annealing processes 


0.030 


0.028 


0.030 


0.026 


0.030 


Tic-Tac-Toe end game 


0.295 


0.185 


0.145 


0.220 


0.295 


LED 24 (noise level = 10%) 


0.261 


0.260 


0.265 


0.263 


0.259 


Solar flare 


0.039 


0.015 


0.020 


0.017 


0.031 


Hypothyroid diagnosis 


0.018 


0.015 


0.020 


0.017 


0.018 


Splice junction gene sequences 


0.046 


0.044 


0.077 


0.057 


0.043 


Chess (King-rook-vs-king-pawn) 


0.124 


0.028 


0.021 


0.021 


0.032 


Mean 

Geo mean vs NB 
W/L vs NB 
W/L vs LBR 


0.185 


0.174 

0.930 

12/7 


0.186 

1.081 

8/19 

8/21 


0.178 

0.960 

10/9 

10/13 


0.183 

0.975 

4/1 

9/11 



notable that of the seven losses to naive Bayes, only three are by more than 
0.002, a very small margin. While the win loss record is not significant at the 
0.05 level using a one-tailed binomial sign test (p=0.1796), the mean across all 
data sets is substantially lower, and, more significantly, the geometric mean error 
ratio strongly favours LBR. It is notable that for the largest data sets LBR is 
consistently winning, halving naive Bayes’ error with respect to solar flare and 
quartering it with respect to chess. 

These results suggest that the LBR’s candidate elimination strategy might 
be suboptimal for small numbers of examples. In other words, it is credible that 
the candidate elimination strategy does not take adequate account of whether 
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there is sufficient data for reliable estimation of the required parameters. It was 
this supposition, derived from previous experiments, that motivated the current 
study. 

Of the three minimum example settings, it seems clear that MinSize = 30 
provides the worst performance. On all metrics it performs worse than naive 
Bayes. The geometric mean error ratio strongly favours naive Bayes as does 
the win/loss record (significantly at the 0.05 level, one-tailed binomial sign test 
p=0.0261). The win/loss record against LBR strongly and significantly favours 
LBR (p=0.0120). 

The situation with respect to MinSize = 100 is less clear cut. It wins as often 
as it loses against naive Bayes. The mean, and more significantly, the geometric 
mean error ratio, both favour MinSize = 100 over naive Bayes, indicating that 
the magnitude of its wins tends to be greater than the magnitude of its losses. 
The win/loss record with respect to LBR favours the latter, but not significantly 
so (p=0.3388). 

The results with respect to MinSize = 500 appear much more straight- 
forward, however. First, it is necessary to consider the outcome for audiology. 
It might initially appear anomalous that MinSize = 500 achieves a different 
outcome to naive Bayes for a dataset with fewer than 500 examples. The expla- 
nation, however, is straightforward. For this dataset there is one classification 
during the ten sets of three-fold cross-validation for which naive Bayes scores two 
classes as equi-probable and for which the random resolution of this draw selected 
different classes for naive Bayes and MinSize = 500. In this case the random 
outcome favoured naive Bayes. Of the larger datasets, for which MinSize = 500 
had the opportunity to move attribute- values to Z 2 , MinSize = 500 consis- 
tently wins over naive Bayes. Restricting the analysis to datasets for which 
MinSize = 500 modifies the behaviour of naive Bayes, the win/loss record 
is 4/0, which approaches significance at the 0.05 level (p=0.0625). 

Table 4 presents the average size of Z 2 (|^ 2 |) and the average number of 
examples from which the probabilities are estimated {\D\) for each dataset for 
LBR and its three variants. It is striking that when there is sufficient data for 
the constraint on minimum numbers of examples to be satisfied, this alterna- 
tive approach tends to add many more values to Z 2 - Consider, for example, 
MinSize = 500 on the King-rook- vs-king-pawn data. More than three times the 
number of attribute values are added to Z 2 even though there is not a large dif- 
ference in the average number of examples selected by each Z 2 . This is because 
MinSize = 500 can keep selecting additional attribute values so long as they 
cover sufficient cases while LBR requires that the selection results in a significant 
reduction in error. 

Of the six datasets for which MinSize = 500 is able to select attribute 
values for Z 2 , LBR obtains lower error for four and higher for two. However, for 
the two for which LBR obtains higher error, the magnitude of the difference is 
very small whereas the magnitude is relatively high for those datasets for which 
LBR achieves lower error. These results suggest that the significance test in 
LBR’s candidate elimination strategy does confer an advantage. Further support 
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Table 4. Mean \Z 2 \ and examples available for estimation of parameters 



LBR MinSize=30 MinSize=100 MinSize=500 





l^2| 


\V\ 


l^2| 


\V\ 


l^2| 


\V\ 


l^2| 


\v\ 


Lung cancer 


0.07 


20.7 


0.00 


21.3 


0.00 


21.3 


0.00 


21.3 


Labor negotiations 


0.00 


38.0 


0.24 


36.4 


0.00 


38.0 


0.00 


38.0 


Postoperative patient 


0.05 


58.9 


1.25 


40.9 


0.00 


60.0 


0.00 


60.0 


Zoology 


0.00 


67.3 


4.13 


35.0 


0.00 


67.3 


0.00 


67.3 


Promoter gene sequences 


0.01 


70.2 


0.47 


53.5 


0.00 


70.7 


0.00 


70.7 


Echocardiogram 


0.02 


87.1 


1.85 


49.7 


0.00 


88.0 


0.00 


88.0 


Lymphography 


0.05 


97.8 


4.31 


43.0 


0.00 


98.7 


0.00 


98.7 


Iris classification 


0.00 


100.0 


0.84 


48.9 


0.00 


100.0 


0.00 


100.0 


Hepatitis prognosis 


0.02 


102.2 


4.28 


36.4 


0.00 


103.3 


0.00 


103.3 


Wine recognition 


0.00 


118.7 


0.74 


86.2 


0.00 


118.7 


0.00 


118.7 


Sonar classification 


0.27 


126.3 


12.39 


40.0 


5.91 


102.4 


0.00 


138.7 


Glass identification 


0.12 


135.3 


3.41 


58.1 


1.01 


118.8 


0.00 


142.7 


Audiology 


0.18 


145.6 


43.33 


48.5 


26.24 


103.0 


0.00 


150.7 


Heart disease (Cleveland) 


0.05 


175.5 


3.31 


47.2 


1.66 


128.1 


0.00 


180.0 


Soybean large 


0.99 


161.0 


13.38 


47.6 


8.37 


109.9 


0.00 


204.7 


Primary tumor 


0.10 


221.3 


3.30 


136.8 


2.51 


161.9 


0.00 


226.0 


Liver disorders 


0.28 


217.5 


4.60 


61.3 


2.97 


138.8 


0.00 


230.0 


Horse colic 


0.47 


192.4 


3.59 


54.1 


2.01 


130.5 


0.00 


245.3 


House votes 84 


0.67 


188.5 


5.44 


54.8 


2.43 


115.7 


0.00 


290.0 


Credit screening (Australia) 


0.20 


425.2 


4.51 


84.9 


3.06 


160.6 


0.00 


460.0 


Breast cancer (Wisconsin) 


0.00 


466.0 


2.38 


150.6 


1.82 


269.9 


0.00 


466.0 


Pima Indians diabetes 


0.23 


455.3 


2.83 


100.0 


1.76 


187.2 


0.00 


512.0 


Annealing processes 


0.09 


570.0 


5.05 


121.4 


4.76 


208.1 


2.52 


545.0 


Tic-Tac-Toe end game 


1.65 


165.1 


2.86 


45.3 


1.85 


121.0 


0.00 


638.7 


LED 24 (noise level = 10%) 


0.50 


571.1 


5.11 


129.8 


3.54 


197.8 


0.50 


603.9 


Solar flare 


0.80 


534.6 


4.71 


235.1 


4.35 


267.4 


3.01 


695.0 


Hypothyroid diagnosis 


0.28 


1923.7 


14.92 


532.5 


14.61 


616.7 


14.04 


832.6 


Splice junction gene sequences 


0.39 


1686.8 


1.98 


413.3 


1.75 


448.4 


1.14 


878.1 


Chess (King-rook-vs-king-pawn) 


3.67 


572.5 


15.62 


136.2 


15.30 


169.2 


11.28 


551.7 



for this conclusion is provided by a second study that compared naive Bayes, 
LBR, and MinSize = 500 in five larger datasets: phoneme (5438 examples), 
mush (8124), pendigits (10992), adult (48842), and shuttle (58000). As ten runs 
of three-fold cross-validation was infeasible for such large data sets, leave-one- 
out cross-validation was performed for 1000 randomly selected examples from 
each data set. For each of these examples, each algorithm was presented all the 
remaining examples in the dataset as a training set and the withheld example 
was then classified. The resulting error rates are presented in Table 5. As can be 
seen, both LBR and MinSize = 500 consistently achieve lower error than naive 
Bayes for these larger datasets. The win loss records of 5/0 are in both cases 
statistically significant at the 0.05 level using a one-tailed sign test (p=0.0313). 
While MinSize = 500 obtains marginally lower error than LBR on one dataset, 
LBR obtains substantially lower error on one and slightly lower on two. 
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Table 5. Error for large datasets 



Dataset 


NB 


LBR 


MinSize=500 


phoneme 


0.265 


0.215 


0.244 


mush 


0.014 


0.000 


0.000 


pendigits 


0.123 


0.028 


0.025 


adult 


0.163 


0.132 


0.137 


shuttle 


0.002 


0.000 


0.001 



6 Conclusions 



This paper makes two contributions to the literature on lazy Bayesian rules. 
First, it presents empirical results on much larger datasets than previously ex- 
plored, providing statistically significant support for the hypothesis previously 
advanced [17] that LBR provides consistent advantage over naive Bayes for large 
datasets. 

The primary motivation for the paper, however, was to investigate alter- 
natives to the candidate elimination criteria employed in LBR, exploring the 
hypothesis that it will never be harmful to select candidate attribute values for 
inclusion in that retain sufficient examples for reliable estimation of the re- 
quired parameters. While some support for this hypothesis was obtained by the 
consistent capacity of MinSize = 500 to reduce error relative to naive Bayes, 
the error reduction capacity of LBR remains higher. This suggests that the sig- 
nificance test serves a useful function in implicitly assessing the relative gains 
from factoring out a harmful attribute interdependence against the losses from 
reducing the amount of data from which parameters are estimated. 

Nonetheless, the MinSize = 500 strategy may offer computational advan- 
tages in some applications. This is because the overheads of assessing how many 
training cases are selected by a candidate attribute value are very low in compar- 
ison to the computational overheads associated with performing a matched-pair 
binomial sign test. For the extremely large datasets employed in some online 
datamining applications these computational considerations may outweigh the 
error reduction capacity of the significance test strategy. 



Acknowledgements. I am grateful to Zijian Zheng for developing the lazy 
Bayesian rules software that was used in these experiments. 
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Abstract. Intelligent agents is a powerful Artificial Intelligence technology 
which shows considerable promise as a new paradigm for mainstream software 
development. However, despite their promise, intelligent agents are still scarce 
in the market place. A key reason for this is that developing intelligent agent 
software requires significant training and skill: a typical developer or undergrad- 
uate struggles to develop good agent systems using the Belief Desire Intention 
(BDI) model (or similar models). This paper identifies the concept set which 
we have found to be important in developing intelligent agent systems and the 
relationships between these concepts. This concept set was developed with the 
intention of being clearer, simpler, and easier to use than current approaches. We 
also describe briefly a (very simplified) example from one of the projects we 
have worked on (RoboRescue), illustrating the way in which these concepts are 
important in designing and developing intelligent software agents. 

Keywords: AI Architectures, distributed AI, multiagent systems, reactive control, 
software agents. 



1 Introduction 

Intelligent agents is a powerful Artificial Intelligence technology which shows consider- 
able promise as a new paradigm for mainstream software development. Agents offer new 
ways of abstraction, decomposition, and organisation that fit well with our natural view 
of the world and agent oriented programming is often considered a natural successor 
to object oriented programming [6]. It has the potential to change the way we design, 
visualise, and build software in that agents can naturally model “actors” - real world 
entities that can show autonomy and proactiveness. Additionally, social agents naturally 
model (human) organisations ranging from business structure & processes to military 
command structures. A number of significant applications utilising agent technology [5] 
have already been developed, many of which are decidedly non-trivial. 

An intelligent agent is one which is able to make rational decisions, i.e., blending 
proactiveness and reactiveness, showing rational commitment to decisions made, and 
exhibiting flexibility in the face of an uncertain and changing environment. 

Despite their promise, intelligent agents are still scarce in the market place' . There is 
a real technical reason for this: developing intelligent agent software currently requires 
significant training and skill. Our experience (and the experience of others) is that a 

* Although abuse of buzzwords is, alas, all too common. 

M. Brooks, D. Corbett, and M. Stumptner (Eds.); AI 2001, LNAI 2256, pp. 557-568, 2001. 
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typical developer or final-year undergraduate student struggles to develop good agent 
systems using the Belief Desire Intention (BDI) model (or similar models). 

Decker [3] discusses problems that undergraduate students have in approaching agent 
oriented development. These include a lack of suitable background in AI (planning and 
goal oriented programs), poor software engineering skills, and a lack of experience at 
dealing with concurrent and distributed programming/debugging, and with communi- 
cation protocols. 

We have found the key problem to be that students cannot clearly identify the nec- 
essary pieces to break the program into and thus tend to build monolithic plans which 
try to handle all contingencies internally, rather than create an appropriate collection of 
plans which can be applied in different contexts. They also have signihcant difficulty 
with interfacing the agent to its environment. Other reasons why developing intelligent 
agent systems is difficult include; 

- Immature tool support: There is a lack of good debugging tools and of tools which 
integrate an internal agent architecture with suitable middleware & infrastructure. 
Additionally, many tools are research prototypes and lack efficiency, portability, 
documentation, and/or support. 

- The need for processes and methodologies: Programmers are familiar with designing 
object oriented systems. However, the design of agent oriented systems differs in a 
number of ways. For example, identifying roles, goals, and interaction patterns. 

- Design guidelines and examples: Designing a collection of plans to achieve a goal 
is different to designing a single procedure to perform a function. This difference is 
fundamental - developing intelligent agents is a different programming paradigm 
and needs to be learnt and taught as such. 

- Complex concepts such as intentions are difficult to explain; this isn’t helped by a 
lack of agreement on concepts and inconsistent terminology. 

- Lack of a suitable^ text book: much of the work on intelligent agents is scattered 
across many research papers (sometimes collected into volumes). 

In the process of working on a number of agent programs, teaching students and 
assisting them to build agent programs, and developing and running workshops for 
academia and industry^, we have developed an initial process of agent design and devel- 
opment for BDI systems."^ This process is explained more fully in the technical report 
[8] and is still being refined and developed. 

This paper identifies the concept set which we have found to be important in de- 
veloping intelligent agent systems and the relationships between these concepts. These 
of course rely heavily on the standard Belief Desire Intention (BDI) concepts [9,10] 
though we have found it necessary to clarify some of the differences between just what 
these concepts are in the initial philosophical work [1], the logical theories [9,13,2] and 
the implementations such as PRS, dMars and JACK. We have also found it important 
to place some emphasis on the concepts of percepts and actions which appear in many 

^ Suitable: Aimed at undergraduates or professional developer and contains enough detail to 
answer the question “How would I actually go about building an intelligent agent?”. 

^ Workshops have been developed and delivered in association with Agent Oriented Software 
Using primarily dMars from the Australian Artificial Intelligence Institute, and more recently 
JACK from Agent Oriented Software. 
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generic models of agents (e.g. [12]) and which are very important in the interfacing of 
the agent deliberation to the external environment. We have found a need to separate 
more clearly between events and goals than is done in dMars or JACK and to provide 
greater support within the execution engine for reasoning about goals than is usually 
done in BDI agent systems [14]. 

Our work in general is focussed on multi-agent systems although this paper concen- 
trates on the concepts required for the internals of each intelligent agent - a necessary 
pre-requisite for multi-agent systems with teams or societies containing such agents. 

2 Background: The BDI Model 

The BDI model [9,10] is a popular model for intelligent agents. It has its basis in philos- 
ophy [1] and offers a logical theory which defines the mental attitudes of Belief, Desire, 
and Intention using a modal logic; a system architecture; a number of implementations 
of this architecture (e.g. PRS, JAM, dMars, JACK); and applications demonstrating the 
viability of the model. The central concepts in the BDI model are: 

Beliefs: Information about the environment; informative. 

Desires: Objectives to be accomplished, possibly with each objective’s associated pri- 
ority/payoff; motivational. 

Intentions: The currently chosen course of action; deliberative. 

Plans: Means of achieving certain future world states. Intuitively, plans are an abstract 
specification of both the means for achieving certain desires and the options available 
to the agent. Each plan has (i) a body describing the primitive actions or sub-goals 
that have to be achieved for plan execution to be successful; (ii) an invocation 
condition which specifies the triggering event^, and (iii) a context condition which 
specifies the situation in which the plan is applicable. 

The BDI model has developed over about 15 years and there are certainly strong 
relationships between the theoretical work and implemented systems. The paper [10] 
describes an abstract architecture which is instantiated in systems such as dMars and 
JACK and shows how that is related to the BDI logic. However, the concepts we have 
found useful for development within these systems do not necessarily match the concepts 
most developed in the theoretical work. Neither are they necessarily exactly the concepts 
which have arisen within particular implemented systems such as JACK. An additional 
complication is small differences between similar concepts, such as Desires and Goals, 
which receive differing emphasis in different work at different times. 

Desires are understood to be things the agent wants to achieve. They play an important 
role in the philosophical foundations, but the logical theory deals primarily with Goals, 
which are assumed to be a consistent set of desires. At the implementation level the 
motivational concept is reduced to events - goals are implicit and the creation of a new 
goal is treated as an event which can trigger plans. Events are ignored in the theoretical 
framework although they play a key role in implementations. In the theoretical model 
plans are simply beliefs or intentions. However in the implementations plans are a central 

^ Some events are considered as goal-events. 
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concept. Some key differences between the philosophy, theory, and implementation 
viewpoints of BDI are shown in the table below. 



Philosophy: 


Belief 


Desire 


Intention 


Theory: 


Belief 


Goal 


Intention 


Implementation : 


Relational DB (or arbitrary object) 


Event 


Running Plan 



3 Concepts for Intelligent Agents 

We describe the set of concepts which we have come to use in developing intelligent 
agent applications. We believe that these are necessary and sufficient for building the 
sort of applications appropriately approached using BDI agents, and we hope that they 
are simple, and clearly explained. The work to develop a formal semantic framework 
for these concepts, thus developing closer links between a theoretical framework and an 
implemented development platform, is work in progress. 

We build up our description of an intelligent agent 
by beginning with a basic, and universally agreed upon 
(see for example [12]), property of agents: they are situ- 
ated (see figure 1). Thus, we have actions and percepts. 

Internally, the agent is making a decision: from the set 
of possible actions As it is selecting an action (or ac- 
tions) to perform (a G As). Loosely speaking, where the 
description of the agent’s internal workings contains a 
statement of the form “[select] X G V” then we have a 
decision being made. Thus the type of decisions being 
made depend on the internal agent architecture. 

An action is something which an agent does, such as 
move-north or squirt. Agents are situated, and an action 
is basically an agent’s ability to effect its environment. In their simplest form actions are 
atomic and instantaneous and either fail or succeed. In the more general case actions can 
be durational (encompassing behaviours over time) and can produce partial effects; for 
example a failed move Jo action may well have changed the agent’s location. In addition 
to actions which directly affect the agent’s environment, we also want to consider “inter- 
nal actions”. These correspond to an ability which the agent has which isn’t structured 
in terms of plans and goals. Typically, the ability is a piece of code which either already 
exists or would not benefit from being written using agent concepts, for example image 
processing in a vision sub-system. 

A percept is an input from the environment, such as the location of a fire and an 
indication of its intensity. The agent may also obtain information about the environment 
through sensing actions. 

A decision: The essence of intelligent agents is rational decision making. There are 
a number of generic, non-application-specific questions which intelligent agents must 
answer, such as: Which action shall I perform now? Which goal do I work on now? How 
shall I attempt to realise this goal? Where shall I go now (for mobile agents)? And who 
shall I interact with (for social agents)? 




Fig. 1. Agents are Situated 
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Mechanisms to answer these kinds of questions are core intelligent agent processes. 
They result in decisions which must fulfil rationality conditions, in that we expect that 
decisions he persistent and only he revisited when there is a good reason for doing so. 
It is also important that the answer to the questions not he trivial: if an agent only has a 
single goal at a time and a single means of realising this goal then we have reduced the 
agent to the special case of a conventional program and there is no scope for decision 
making or for flexible, intelligent behaviour. 

Note that although the concept of a decision is fundamental to intelligent agents, it 
is not always necessary to represent the decisions explicitly. For example, the decision 
regarding choice of goal could be represented using a “current goal” variable which is 
updated when a decision is made. 

We now consider the internal workings of the 
agent (see figure 2). We want our intelligent agents 
to be both proactive and reactive. A proactive 
agent is one which pursues an agenda over time. 

The agent’s proactiveness implies the use of goals 
and modifies the agent’s internal execution cycle: 
rather than select an action one at a time, we se- 
lect a goal which is persistent and constrains our 
selection of actions. A reactive agent is one which 
will change its behaviour in response to changes 
in the environment. An important aspect in deci- 
sion making is balancing proactive and reactive 
aspects. On the one hand we want the agent to 
stick with its goals by default, on the other hand 
we want it to take changes in the environment into account. The key to reconciling these 
aspects, thus making agents suitably reactive, is to identify significant changes in the 
environment. These are events. We distinguish between percepts and events: an event is 
an interpreted percept which has significance to the agent. For example, seeing a fire is 
a percept. This percept could give rise to a new fire event or afire under control event 
depending on history and possibly other factors. 

A goal (variously called “task”, “objective”, “aim”, or “desire”) is something the 
agent is working on or towards, for example extinguishfiire, or rescue ^civilian. Often 
goals are defined as states of the world which the agent wants to bring about; however, this 
definition rules out maintenance goals (e.g. “maintain cruising altitude”) and avoidance 
goals, or safety constraints (e.g. “never move the table while the robot is drilling”). Goals 
give the agent its autonomy and proactiveness. An important aspect of proactiveness is 
the persistence of goals: if a plan for achieving a goal fails then the agent will consider 
alternative plans for achieving the goal in question. We have found that goals require 
greater emphasis than is typically found in existing systems. It is important for the 
developer to identify the top level goals of the agent as well as subsidiary goals which 
are used in achieving main goals. Our modified execution engine does significantly more 
reasoning about goals than is usual in BDI implementations [14], including reasoning 
about interference between goals and how to select goals when it is not consistent to 
pursue them simultaneously. We differentiate between top level goals and subsidiary 




Fig. 2. Proactive Agents have Goals, Re- 
active Agents have Events 
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goals in that subsidiary goals are not important in their own right and may therefore be 
treated differently in the reasoning process than top-level goals. 

An event is a significant occurrence. Events are often extracted from percepts, al- 
though they may be generated internally by the agent, for example on the basis of a clock. 
An event can trigger new goals, cause changes in information about the environment, 
and/or cause actions to be performed immediately. Actions generated directly by events 
correspond to “reflexive” actions, executed without deliberation. Events are important 
in creating reactive agents in that they identify important changes which the agent needs 
to react to. 

Agents in realistic applications usually 
have limited computational resources and lim- 
ited ability to sense their environment. Thus 
the auxiliary concepts of plan and belief are 
needed. Beliefs are effectively a cache for 
perceived information about the environment, 
and plans are effectively a cache for ways of 
pursuing goals (see figure 3). Although both 
of these concepts are “merely” aids in effi- 
ciency, they are not optional. Beliefs are essen- 
tial since an agent has limited sensory ability 

and also it needs to build up its knowledge of - .... i- r 

^ ° Fig. 3. Adding Plans and Beliefs 

the world over time. Plans are essential for two 

reasons. The first is pure computational effi- 
ciency: although planning technology and computational speed are improving, planning 
from action descriptions is still incompatible with real time decision making. The second 
reason is that by providing a library of plans we avoid the need to specify each action’s 
preconditions and effects: all we need to provide for an action is the means to perform 
it. This is signihcant in that representing the effects of continuous actions operating over 
time and space in an uncertain world in sufficient detail for hrst principles planning is 
unrealistic for large applications. 

A plan is a way of realising a goal, for example a plan for achieving the goal 
extinguishjire might specify the three steps: plan a route to the fire, follow the route to 
the fire, and squirt the fire until it has been put out. Although the concept of a plan is 
common there is no agreement on the details. From our point of view it is not necessary to 
adopt a specific notion of a plan, rather we can specify abstractly that a plan for achieving 
a goal provides a function which returns the next action to be performed. This function 
takes into account the current state of the world (beliefs), what actions have already been 
performed, and might involve sub-goals and further plans. For computational reasons 
it is desirable for this to at least include a “library of recipes” approach, rather than 
requiring construction of plans at runtime from action descriptions. 

A belief is some aspect of the agent’s knowledge or information about the environ- 
ment, self or other agents. For example an agent might believe there is a fire at X because 
she saw it recently, even if she cannot see it now. 

These concepts (actions, percepts, decisions, goals, events, plans, and beliefs) are 
related to each other via the execution cycle of the agent. An agent’s execution cycle 
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follows a sense-think-act cycle, since the agent is situated. The think part of the cycle 
involves rational decision making, consisting of the following steps: (depicted in hgure 4) 



1 . 

2 . 

3. 

4. 

5. 



6 . 

7. 



Percepts are interpreted (using be- 
liefs) to give events 
Beliefs are updated with new in- 
formation from percepts 
Events yield reflexive actions 
and/or new goals 
Goals are updated, including cur- 
rent, new and completed goals. 

If there is no selected plan for the 
current goal, or if the plan has 
failed, or if reconsideration of the 
plan is required (due to an event) 
then a plan is chosen. 

The chosen plan is expanded to 
yield an action 

Action(s) are scheduled and performed 
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Fig. 4. Agent Execution Cycle 



By comparison, the BDI abstract execution cycle [10] consists of the following steps: 
(1) use events to trigger matching plans (options), (2) select a subset of the options, (3) 
update the intentions and execute them, (4) get external events, and (5) drop successful 
and impossible attitudes. The execution cycle presented here differs from the BDI exe- 
cution cycle in a number of ways including the use of reflexive actions, the derivation 
of events by interpreting percepts, the process of going from goals to plans to actions, 
and increased reasoning about goals. However, the two primary contributions are the 
role of top level goals (which are distinguished from events and from sub-goals, and are 
persistent) in achieving proactiveness, and the role of events (as significant occurrences) 
in creating suitably reactive agents. 



4 A Case Study: RoboRescue 

One of the applications we have worked on recently is RoboRescue. We describe a 
greatly simplified version of a part of this application in order to illustrate concretely the 
concepts we have identihed. 

RoboRescue [11] is a long-term (50 years!) project which has the goal of creating 
robotic squads which could be deployed in the aftermath of a disaster such as an earth- 
quake. Tasks to be carried out include rescuing & evacuating people and controlling fires. 
Challenges faced include the lack of information and an environment which contains 
obstacles (including collapsed buildings, obstructed roads, fires, etc.) and potentially 
limited communication. 

The RoboRescue simulator has a number of components which simulate different 
aspects of a disaster scenario. The intention is that these aspects combine synergistically 
to provide a realistically complex and challenging environment. There are a number of 
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different types of agents in the system including fire engines, ambulances, civilians, and 
police agents. At each simulator cycle agents receive visual information and possibly 
heard information. 

Due to space limitations in this paper we focus on a single agent type (fire engine) 
and a simple set of its behaviours focussed on hre extinguishing. 

To design the fire engine agent using the concepts described earlier we look at each 
concept and identify instances of it in the agent system. This process is, in general, 
iterative: when designing a plan we may realise that the agent needs to know a certain 
piece of information which implies the addition of a belief which might imply the 
addition (or modihcation) of goals and plans. A comprehensive methodology for the 
detailed design of agent systems is described in [8]. Here we concentrate on illustrating 
the way in which instances of the relevant concepts are identihed and defined. 

Decisions are not considered here since questions which an agent needs to answer as 
it runs aren’t specihc to a given application domain; rather, they are specific to a given 
agent architecture. Detailed design regarding how agents will work together, exactly 
what plans will be used and what sub-goals will be needed is a design task well beyond 
the scope of this paper and requiring the more extensive methodology of [8]. Rather, we 
focus on the initial aspects of the process and the identification of some of the relevant 
concepts in each class. More concept instances will inevitably be identihed as the design 
progresses. 

Percepts: Percepts represent an interface to the environment, so are often, as in this 
case, at least partially predehned. There are two types of percepts in RoboRescue - visual 
and auditory. Auditory information may be broadcast information which in the case of 
a hre-engine agent may be a message from another hre-engine agent, the hre-engine 
center, or a civilian either crying for help or stating that they have heard a cry for help. 
It may also be a message directed specihcally to that agent from another hre-engine 
agent. The content of messages from hre-engine agents and the hre-engine center is an 
aspect of perceptual information which is under the control of the designer and must be 
decided. Visual information contains a current view of the environment including such 
things as roads, buildings, hres and their intensity, etc. 

Percepts must be processed to build up knowledge of the environment (beliefs) and 
to extract events. In the case of the visual information in RoboRescue it is hrst processed 
in order to add any information to the map of the world that is built incrementally. It is 
then processed to extract the position of each hre which is assessed to see whether there 
is a hre-related event which should be generated and to further update the knowledge of 
the environment. 

Actions: The actions which an agent can perform are also part of its interface to 
the environment. As indicated earlier, there may be additional actions dehned beyond 
those that affect the environment, but these are a starting point. In this case the agent can 
perform the external actions of squirting which reduces the fire at the current location, 
moving which moves the agent an unspecihed number of steps along a given route, 
telling (broadcasting) and saying (sending) a message. The move and squirt actions may 
need to be applied repeatedly to achieve the desired goal, for example a hre may need 
to be squirted multiple times before it is extinguished and the agent may need to move 
several times before it reaches its destination. 
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We also identified early on the action of planning a route between two points as an 
internal action. A route is a necessary parameter for the move action, code existed for 
planning a route given the map, a current position and a destination, and there seemed 
to be no clear advantage in using goals and plans to achieve this task, particularly given 
the existing code. Thus we had an initial set of hve actions, four that were part of the 
external interface to the simulator and one which was internal. 

Goals: An obvious major goal of the fire engine agent is to put out any fire. We also 
identified a goal of discovering fires. Additional thought yielded for us the further goals 
of assisting a team-mate to put out a fire and coordinating a team effort to extinguish a 
fire. The goals of the agent obviously have to do with the motivation for the system, but 
are not externally defined in the way that at least some of the percepts and actions are. 
Choosing the appropriate set of top-level goals is one of the early design decisions that 
need to be made. 

It is tempting to treat goals as being implied by the beliefs: any belief in the existence 
of a fire implies a goal extinguish the fire. However, this approach has a number of issues. 
Firstly, it is hard to indicate that certain fires should not be pursued (e.g. because another 
agent is dealing with them). Secondly, it is difficult to add goals which are not directly 
prompted by environmental cues. For example, the goal to find fires exists independent 
of cues in the environment, and should result in exploratory behaviour if no other goals 
are inhibiting this goal. 

In addition to identifying the goals of an agent, we need to specify when to adopt 
and drop goals (Including how to recognise when they are achieved), plans for achieving 
these goals, sub-goals that may be part of achieving each goal and relative priorities and 
interactions between the goals. 

As indicated in section 3, events, i.e. significant things which happen in the environ- 
ment, often result in the adoption of goals. Thus the event of receiving a message from 
a team-mate requesting help in extinguishing a fire is likely to result in a goal to assist 
that team mate to put out his fire. The event of noticing a fire is likely to result in a goal 
to extinguish that fire. Thus we move onto events. 

Events: Here we are identifying significant occurrences that are likely to make the 
agent add or delete goals, change goal priorities, or change how the agent is pursuing a 
goal. We also look for significant occurrences which affect our beliefs. Many events will 
be the result of processing percepts - e.g. extracting information about fires and their 
locations, checking these against a list of fires we already know about, and obtaining 
any “new-fire” events. Some events will also be generated internally as a result of the 
agent’s own behaviour. For example after sending a request for assistance with a fire, 
the agent may generate a “help-requested” event. As events are used to update beliefs, 
trigger plans and generate reflexive actions there are likely to be a large set of events 
which are developed iteratively In the process of developing the full design. The detailed 
design methodology in [8] allows for a layered identification of events in conjunction 
with incremental refinement of sets of plans.® 

The initial events that we identify typically have to do with the significant occurrences 
that will alert the agent to the need to instantiate one of its top level goals, recognise one 
of its top level goals as achieved, or indicate a need for reconsideration. 

® These sets of plans and related events are actually JACK capabilities. 
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In this example such events include 

- “new-fire” which causes instantiation of an “extinguish” goal and also causes a 
reflexive action to broadcast the existence of the fire to other fire-engine agents; 

- “fire-extinguished” event which causes the agent to recognise that a goal has been 
achieved; 

- “fire-urgent” event which indicates fire is growing and is larger than a given thresh- 
old. This indicates a need for reconsideration and in our design may lead to a goal 
to co-ordinate a team effort to extinguish the fire; 

- “help-requested” which leads to a reconsideration of current priorities and may lead 
to a goal to assist team-mates; 

Identification of events determines what interpretation or processing we need to do 
with the percepts received in order to be able to recognise events. This in turn also often 
affects what information about the environment we need to represent - e.g. to be able 
to generate a new fire event from a visual percept we need to explicitly represent which 
fires we already know about. 

Beliefs: Beliefs are really any knowledge the agent maintains. Some of this informa- 
tion may be kept in the form of a special purpose knowledge database, other information 
may be kept in arbitrary suitable data structures. For this application the primary in- 
formation needed is an internal map of the environment including the location of fires, 
roads, buildings, etc. Updating of this data structure is part of the important processing 
of percepts. 

Beliefs tend to be used primarily in two ways. The first is in extracting events from 
percepts: to recognise a new fire we have to have knowledge about existing fires. The 
second is in determining which plan should be used to achieve a goal in a particular 
situation. Information that allows us to choose between two alternative ways of achiev- 
ing a goal - or even whether there is any way of achieving a goal, depends on some 
representation of beliefs. 

Some of the beliefs we have identified as important here (in addition to the map) 
are: which fires are being attended to and by who; current priority of fires; and whether 
a route is available to a particular location (it may not be, either due to blockages, or 
insufficient information about the environment). 

Plans: Plans describe various ways for us to achieve our goals. To get the full power 
of the BDI approach it is advantageous to define simple plans, with use of sub-goals 
wherever possible. Initially there may only be one straightforward way to achieve each 
goal, but new variations can be added in a modular fashion. This allows development of 
a simple agent that manages straightforward cases first, with addition of variations for 
more complex cases afterwards. Plan sets need to be checked for coverage and overlap 
regarding the situations which can arise. 

A very simple pair of plans for achieving the goal to extinguish a fire at X would be 
one plan which simply squirts until the fire is extinguished (suitable if the agent is already 
at location X), and another plan which obtains a route to X, moves to X, then squirts until 
the fire is extinguished (suitable if the agent is not already at X). Alternatively we could 
have a plan for putting out a fire and a plan for extinguishing a fire as shown below: 
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Goal: put-out_fire(position) 
plan_route(position) 
move(route) 
extinguish_fire(position) 



Goal: extinguish_fire(position) 

Condition: location = position 
squirt 

extinguish_fire(position) unless nofire(position) 



In fact this plan is simple enough that it can be implemented reactively using a set 
of trigger response rules, all that needs to be stored for this approach is the target’s 
coordinates. However this approach is unable to handle sequences of actions where 
the triggers aren’t in the environment and is unable to manage commitment. For these 
reasons plans need internal state to track what has been done, and need to be able to 
specify a sequence of actions to be done. 



5 Discussion 

We have presented the concepts we have found important in building BDI agent systems 
and the relationships between these concepts. We have found these concepts to be clearer 
and easier to teach and use than the BDI model. The concepts presented: 

- Distinguish between goals and sub-goals, between percepts and events, and between 
events and goals. BDI implementations, by comparison, do not distinguish these and 
merge them all into an “event” type. We feel that this distinction is important since 
goals and events play roles in achieving reactivity and proactiveness and have rather 
different properties: for example, goals persist until they are achieved. 

- Explicitly represent goals. This is vital in order to enable selection between compet- 
ing goals, dealing with conflicting goals, and correctly handling goals which cannot 
be pursued at the time they are created and must be delayed [14]. 

- Highlight goal selection as an important issue. By contrast, BDI systems simply 
assume the existence of a selection function. 

- Emphasise the importance of percepts and actions. 

- Highlight the role of events in creating reactive agents. 

We also introduced reflex actions, generalised the concept of a plan, decomposed the 
concept of intentions (which we have found to be difficult to explain and teach) into 
the simpler notion of a decision, and provided a hierarchical, staged presentation of the 
concepts. 

Our design process (which is still being refined and developed, and which is explained 
more fully in [8]) is focussed on the detailed design and we view it as complementary to 
methodologies such as GAIA [15] and Tropos [7] which focus more on the higher level 
design and analysis and on the requirements aspects of agent systems. 

The choice of concepts was driven by three sources. Primarily, our experience work- 
ing on a number of agent programs, teaching students and assisting them to build agent 
programs, and developing and running workshops for academia and industry. Secondly, 
a survey of a range of agent systems (see below); and finally, “bottom up” derivation of 
concepts from first principles. 

Although there is some degree of consensus in the deliberative agent research com- 
munity that the BDI model is a reasonable common foundation for intelligent agents 
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there are also known shortcomings of the model [4]. Thus, in order to avoid being “BDI- 
biased” we surveyed a range of agent systems which address the internals of an intelligent 
software agent. For more details see http://www.cs.rmit.edu.au/~winikoff/SAC. 

There is much further work to be done including continuing to apply the concepts 
and design process to various applications. We are also starting to survey students and 
professionals regarding the concepts they regard as natural for developing agent systems. 
Developing a formal semantics for the concepts we have identified, as well as developing 
support tools (design, development, and debugging) are also high priorities. Finally, the 
concepts identified need to be extended (and revised) to support the creation of social 
intelligent agents. 

Acknowledgements. We would like to acknowledge the support of Agent Oriented 
Software Pty. Ltd. and of the ARC (under grant C00106934). 
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Abstract. In recent years, there have been increased efforts towards defining 
rigorous operational semantics for a range of agent programming languages. At 
the same time, there have been increased efforts to develop logical frameworks 
for modelling belief, desire and intention (and related notions) that make closer 
connections to the workings of particular architectures, thus aiming to provide 
some computational interpretation of these abstract models. However, there re- 
mains a substantial gap between the more abstract logical approaches and the 
more computationally oriented operational approaches. In this paper, we develop 
an operational semantics for a simplified language based on PRS that is derived 
directly from a high-level abstract interpreter; thus taking one step towards bridg- 
ing this gap in the case of a simplified agent programming language sufficiently 
expressive to incorporate a simple notion of intention. 



1 Introduction 

In recent years, there have been increased efforts towards defining rigorous operational 
semantics for a range of agent programming languages. In this paper, we focus on lan- 
guages based on the BDI (Belief, Desire, Intention) agent architecture as embodied in 
PRS and its variants, Georgeff and Lansky [8], Georgeff and Ingrand [7]. Such lan- 
guages include AgentSpeak(L), Rao [13], Vivid, Wagner [17], dMARS, d’Inverno et 
al. [5], 3APL, Hindriks et al. [9], and tl/, a PRS-style programming language defined 
using process algebra, Kinny [10].' Operational semantics have also been given for 
Concurrent MetateM, Wooldridge [20], using approaches from distributed comput- 
ing, and for ConGolog, de Giacomo et al. [4], using the situation calculus. 

However, especially for languages based on BDI architectures, there remains a sub- 
stantial gap between operational descriptions and semantic descriptions at the higher 
“intentional” level. This gap means that “cognitive” properties of agents, such as ra- 
tionality and commitment, that are typically modelled at this higher level, are not sys- 
tematically connected to the properties of implemented agents as described at the oper- 
ational level, raising doubts as to the efficacy and accuracy of such higher-level mod- 
elling in relation to practice. In part because operational definitions are designed to be 
self-contained, they are not explicitly related to more high-level descriptions of agent 
behaviour. 

* Most of these operational definitions are incomplete, thus leaving some choices of computation 
mechanism up to the particular implementation. 
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In this paper, we provide an operational description for a simplified language based 
on PRS that is derived directly from a high-level description of its abstract interpreter. 
This goes part way towards bridging the gap between the operational and intentional 
levels of description - in future work, we intend to investigate closer connections be- 
tween the two levels of description. The organization of the paper is as follows. In 
section 2, we motivate our simplified, reconstructed version of PRS by relating it to 
the descriptions of PRS given in the literature. In section 3, we define more precisely 
the operational semantics of our language using Plotkin-style transition systems to de- 
fine plan execution. We conclude in section 4 with a discussion of Bratman’s theory of 
intention in relation to our reconstructed architecture. 



2 PRS: A Reconstruction 

PRS {Procedural Reasoning System) was initially described in Georgeff and Lansky [8], 
with further elaboration given in Georgeff and Ingrand [7]. Here we give a reconstruc- 
tion of PRS along lines of the implementation of UM-PRS [11], and will speak of a 
‘PRS-like’ architecture and ‘PRS-like’ plans to acknowledge both their roots and the 
differences between our reformulation and the original language specification and im- 
plementation.^ 

Basically, we consider a PRS agent program to be a collection of plans, originally 
called Knowledge Areas (KAs). These plans are essentially the same as standard plans 
in the Artificial Intelligence literature (though also allowing conditional and iterative 
actions), in that they have a precondition (a condition under which the plan can be ex- 
ecuted), an postcondition (a condition that indicates successful execution of the plan 
- note that this is quite different from the expected outcome of executing the plan), 
and a body (a collection of actions which when successfully executed will normally 
achieve the postcondition). The body of a plan is also very similar to a standard com- 
puter program, except that there can be special actions of the form achieve 7 , meaning 
that the system should achieve the goal 7 in whichever way is convenient: these are the 
analogues of procedure calls. 

To enable decisions about plan execution to be made at runtime, PRS-like plans 
extend standard planning formalisms in having a context (a condition that must be true 
when each action in the plan is initiated),^ a trigger (a condition that, in conjunction 
with the precondition, indicates when the interpreter should consider the plan for execu- 
tion), a termination condition (a condition indicating when the plan should be dropped), 
and a priority (a natural number indicating how important the plan’s goal is to achieve). 
The context is important in dynamic settings: when there are a number of ways of 
achieving a particular goal, the context helps the interpreter to find the “best” way of 
achieving the goal in the current environment, which, due to unforeseen changes in the 
world, cannot always be predicted in advance. The priority of each plan enables the 

^ Thanks to David Kinny for clarifying differences between the different versions of PRS and 
other PRS-like languages/sy stems. 

^ Contexts are similar to maintenance conditions in PRS, but differ in that maintenance condi- 
tions must be true throughout the execution of a plan. 
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system to determine which plan to pursue given a choice of potential plans (usually a 
plan with the highest priority is chosen for execution). 

The original definition of PRS allowed for meta-KAs, or plans that are used to 
determine which plans to invoke and execute."^ But in practice, meta-plans have not 
been widely used, and this is partly because they are not available in implementations 
of PRS such as UM-PRS, Lee et al. [11]. Instead, we assume that each plan has a state- 
dependent utility used in conjunction with its (state-independent) priority to further 
discriminate between potential plans at runtime, and a state-dependent cost used to 
indicate the loss involved in temporarily suspending a current plan in favour of a newly 
triggered plan. More precisely (and this should be clearer with the operational definition 
given below), we assume that in any given state, the interpreter considers as options all 
plan instances whose precondition and context hold in the current belief state and whose 
trigger condition is satisfied in virtue of the most recent observations of external events. 
Further, we assume that the agent chooses to act, in any given state, upon the plan with 
the highest priority that has the highest value, where, for a current plan, the value is 
simply the plan’s utility in that state, and, for a newly triggered plan, the value is the 
plan’s utility minus the cost of temporarily suspending the highest valued viable current 
plan, again relative to that state (if this does not yield a unique selection, one such plan 
is chosen at random). Thus priorities, utilities and costs together determine a simple 
“commitment strategy” for the agent. 

The agent’s computation cycle can be conveniently described with reference to 
the simplihed interpreter for BDI agents shown in Figure 1, presented by Rao and 
Georgeff [15] (note that here there is also no mention of meta-plans). In this abstract 
interpreter, the system state consists of sets of beliefs B, goals G and intentions I. Each 
element of I corresponds to a partially executed concrete hierarchical plan, towards 
which the agent has made some prior commitment, that is either currently active or has 
been suspended due to an alternative goal being pursued: for consistency, we will call 
such items (concrete) plans rather then intentions, reserving this term for the mental 
analogues of individual actions in such plans. Each cycle of the interpreter runs as fol- 
lows. The process begins with a collection of external events stored in a queue, any or 
all of which may trigger pre-existing abstract plans: along with the currently active and 
suspended concrete plans that are feasible in the current state and the rehnements of 
such plans (defined below), these constitute the options available to the agent. Then, 
“deliberation” determines which such plan is chosen for execution, the set I is updated 
to reflect this choice, and the agent executes the next action in the chosen plan. After 
obtaining new external events, the set of current intentions is further updated, first by 
removing those that have successfully completed (whose postcondition holds), then by 
dropping those which are impossible to complete (whose termination condition holds). 

The above description of the interpreter leaves many computational issues unre- 
solved. Hence we make the following additional assumptions concerning PRS-like pro- 
grams, many of which hold for practical implementations of PRS. 

- the belief set B is updated as new events are added to the event queue (so B repre- 
sents the agent’s beliefs about the current world state); 

In the original descriptions of PRS, KAs do not have priorities, so meta-KAs were necessary 
for fulfilling this purpose. 
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Abstract BDI Interpreter: 
initialize-state(); 
do 

options := option-generator(event-queue, B, G, I); 
selected-options := deliberate(options, B, G, I); 
update-intentions(seiected-options, I); 
execute(l); 

get-new-externai-events(); 
drop-successful-attitudes(B, G, I); 
drop-impossibie-attitudes(B, G, I) 
untilquit 



Fig. 1. Abstract BDI Interpreter 



- the belief set B is consistent (i.e. it is assumed that inconsistencies in the set of 
events observed in any one cycle are resolved when updating beliefs); 

- the event queue is cleared after options are generated (this ensures the agent’s in- 
ternal processing does not lag behind its observations of the world); 

- the generated options consist of all concrete instantiations of newly triggered plans, 
all plans in the set I whose next action is feasible (its precondition and the context 
of the subplan in which it occurs are consequences of B), and all refinements of the 
current plans with respect to the current belief set B (the definition of a refinement 
is given below); 

- each element of the set I is a concrete hierarchical plan (defined below); 

- the set of goals G consists of the postconditions of the concrete plans in the set I; 

- for each plan in the set I, if P' is a maximal subplan of P with respect to the property 
of being believed successful (its postcondition is a consequence of B), or impossible 
(its termination condition is a consequence of B), the updated set of intentions, 
instead of containing P, contains P with P' and all its subplans removed; 

- the updated set of intentions I includes all plans from the previous cycle that have 
not been modified (i.e. suspended plans persist unless believed to be successful or 
impossible). 

To define the auxiliary concepts used in the above description, we first assume a 
finite predicate language £. incorporating the usual connectives A, V, and => but no 
quantifiers, i.e. each atomic formula of £. is of the form r{ti, ■ ■ ■ , tn), where r is a 
relation name and f, is a term that may contain function symbols, constant symbols 
and variables. The agent’s belief set B at any time is assumed to be a consistent set of 
ground literals of C, and it is assumed that for any ground formula a of L, the issue of 
whether ct is a consequence of B (denoted B h ct) is decidable (and that such a can be 
generated from non-ground formulae of C). Goals and actions are also represented by 
formulae of L (possibly non-ground in the case of abstract plans). 

We next assume that the agent has a plan library containing a set of abstract plans. 
The conditions associated with each plan are formulae of L, and each plan body is a 
program, defined below. 
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Definition 1. The language o/ programs is defined as follows. First, the atomic action 
formulae of C, including the special formulae achieve 'y for formulae 7 of C, plus a 
symbol A denoting the “null” action, are (atomic) programs. Second, if t: and x 
programs and a is a formula of C, the following are all programs (and no other expres- 
sions are programs). 

7t;x (sequential) 

if a then tt else x ( conditional) 

while ct do 7T (iterative) 

The precondition of the sequence tt; x is that of tt; its postcondition that of The pre- 
condition and postcondition of A and of any conditional or iterative program is assumed 
to he true. The precondition of achieve 7 is false and its postcondition is 7. 

Definition 2. A (hierarchical) plan is a nonempty sequence of programs [tti , • • • , 7 t„], 
each with an associated precondition, postcondition and context, whose initial element 
is achieve y for some goal formula 7 ofC (with precondition true, postcondition 7 and 
context true ). 

Definition 3. The active program of a plan [tti , • • • , 7 t„] is its final element 7 t„. 

Definition 4. A plan P' is a suhplan of a plan P = [tti , • • • , 7t„] if P' is a nonempty 
suffix [tt,, • • • , 7T„] of the sequence P. 

Definition 5. A plan is concrete if all its contained programs and associated conditions 
are ground formulae; otherwise the plan is abstract 

Definition 6. A plan P' = [tti , • • • , 7t„, tt] is a refinement of a concrete hierarchical 
plan P = [tti , • • • , 7T„] with respect to the belief set B under the following conditions. 

- if 7 T„ is achieve 7; 7;, tt is a program corresponding to the body of an (instantiated) 
plan from the plan library whose (instantiated) postcondition implies 7, whose (in- 
stantiated) precondition and context are consequences of B and whose (instanti- 
ated) trigger is implied by the beliefs corresponding to the most recently observed 
events ( the program achieve 7 is assumed to be equivalent to achieve 7; 

- if the context of 7 t„ is not a consequence of B, tt is a program corresponding to 
the body of an (instantiated) plan from the plan library whose (instantiated) post- 
condition implies the context ofiTn, whose (instantiated) precondition and context 
are consequences of B, and whose (instantiated) trigger is implied by the beliefs 
corresponding to the most recently observed events ( in this case, the plan acts as a 
recovery plan for 7T„ by restoring its context). 

Finally, the PRS-like agent’s “commitment strategy” is embodied in the following 
assumptions. 

- In any state, the utility of any refinement of a plan is at most that of the original 
plan; 

- Deliberation chooses for execution a plan with the highest value amongst those 
highest priority options generated. 
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Here, the state refers to the mental state of the agent rather than the state of the world, 
i.e. all values, utilities and costs are subjective. 

That is, the PRS-like agent is assumed to have some commitment to its intentions in 
the sense that once a plan is chosen for execution by deliberation, the agent continues to 
attempt to fulfil the corresponding intention, through performing the actions contained 
in the plan body and achieving its various subgoals and attempting to recover from 
failures, unless an alternative option of higher priority (or of equal priority and higher 
value) becomes available, in which case, the plan is (or may be) suspended, or the plan 
is believed impossible to complete, in which case it is abandoned. 

3 Operational Semantics 

The previous section provides a fairly complete intuitive operational definition of the 
execution of PRS-like agents, but a more formal definition is needed if such execution 
strategies are to be shown to correspond to a logical modelling. In this section, we give 
such a formal operational semantics, for plan execution using the structured operational 
semantics style of definition due to Plotkin [12], and following the approach of Wag- 
ner [17]. This requires the definition of the PRS-like agent’s internal states, then the 
definition of the transitions between states corresponding to each step in the abstract 
BDI interpreter described above. 

Definition 7. A state is a pair (B, I) where B is a consistent set of literals of C and I is 
a set of concrete hierarchical plans. 

The agent’s state, together with the event queue, contains all and only the infor- 
mation the agent uses in planning and acting. Thus the PRS-like agent does not have 
complex beliefs, for example, introspective beliefs about its beliefs or intentions, beliefs 
about the possible effects of its actions in various states, or beliefs about the past or the 
future. This has the consequence that all conditions associated with plans (such as pre- 
conditions and context conditions) can refer only to the current state of the environment, 
not to other aspects of the agent’s state such as its current intentions. One motivation for 
this is speed of computation: option generation is meant to be a simple process based 
on the newly observed events occurring in the environment (as recorded in the event 
queue). It is the role of the deliberation step to make a “decision” concerning which 
option to pursue, or whether to continue pursuing an existing plan. 

Let us now define a transition function from states to states that formalizes the 
operational semantics of one cycle of the PRS-like agent interpreter. We assume that 
the input belief set to this function is the belief set computed on the previous cycle, 
after new observations have been used to modify the old belief set, and after plans 
considered successful or impossible have been dropped. For the very first cycle, we can 
assume the agent has no plans and some arbitrary set of beliefs. 

The general form of the transition function is the composition of a series of func- 
tions, each corresponding to one step in the abstract interpreter as described by Rao and 
Georgeff [15]. That is, the operation of the interpreter is characterized by a triggering 
function r (that returns a set of plans and leaves the state unchanged), a deliberation 
function 5 (that returns a plan tt and leaves the state unchanged), an update function 
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for intentions prior to action t, a plan execution function e, a belief update function P 
(which also accepts a consistent set of literals representing events), and a plan update 
function v, all of which return a modified state. 

We now define each of the component functions in turn. The most complicated 
function is the plan execution function, being defined recursively on the structure of 
programs. In the following, let E be the set of literals in the logical language C (i.e. the 
atomic formulae of C and the negations of such formulae), let # be the class of concrete 
hierarchical plans, let S be the class of states (i.e. the set of pairs (B, I) where B is a 
consistent set of literals of C and I is a set of concrete hierarchical plans), and for a 
given state <t, let B((t) and I((t) denote the belief set and intention set in a, respectively. 

3.1 Triggering, Deliberation, and Intention Update 

The interpreter is assumed to have access to an event “queue” containing a set of ground 
literals of C corresponding to the events observed on the previous computation cycle, 
which collectively trigger various concrete plans. For each concrete plan instantiating a 
plan in the plan library (the plan library is assumed fixed), let trigger{Tr), pre{Tr) and 
context{'K) denote the trigger, precondition, and context of tt, respectively. 

Definition 8. The triggering function r is the function t : 2^ x E ^ 2'^ defined as 
follows. 

r(e, a) = {4> G : e\- trigger{4>) and B((t) h pre(4>) A context(4>)} 

In the “deliberation” step, a plan is selected for execution from amongst the newly 
triggered plans, the current plans eligible for execution, and the possible refinements 
of the current plans. We define a deliberation function S whose inputs are the newly 
triggered plans 'P and the initial state a. We use auxiliary functions r' and r" that 
return, respectively, the current plans eligible for execution and the refinements of the 
current plans. Assume that p{B, f) returns the set of refinements of a hierarchical plan 
(f> with respect to a belief set B, as defined above. 

^ ^ pre{(f>) A context{(f>)} , r'fa) = [J p{B{a),(f>) 

Here it is understood that the precondition of a plan f is that of its active program, 
and the context that of the instantiated plan in the plan library from which the active 
program was derived. 

For any given concrete hierarchical plan f, let the priority of (f> be denoted p{(f>) 
(that of the plan from which its active program was derived) and let the value of f 
in state a be denoted v{(f>, a) (i.e. subtracting a penalty cost from the utility of each 
newly triggered plan). The deliberation function returns a randomly chosen plan that 
has maximal value amongst those with maximal priority amongst the input plans (or 
the empty plan [A] if no such plan exists). Let the choice function be denoted ‘ran’. 

Definition 9. The deliberation function S is the function S : 2'^ x S ^ ^ defined as 
follows. 



' ’ I [A] otherwise 
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where 

S' ('P, a) = arg max argmax {il) GPU t' (a) U t" (a)} 

v(ip, (j) p('0) 

Finally, the intention update function either adds the selected newly triggered plan 
to the set of intentions or else replaces a plan by one of its refinements (depending on 
which plan is selected by deliberation), and returns the plan selected and the new state. 

Definition 10. The intention update function l is the function S:PxE^PxE 
defined as follows. 

a) = if, (B(cr), i'if, a))) 

where 

when (f> G pif) 

^ I I((t) U {f} otherwise 

3.2 Plan Execution 

The operational semantics of plan execution is defined by means of a transition func- 
tion e that specifies the effect of executing one atomic step of the agent’s plan in a 
given state, returning the remainder of the plan to be executed and the resulting state 
(a continuation). The auxiliary function e' defines the execution of a single step in a 
program. 

Definition 11. The plan execution transition function e is the function e : P x E ^ E 
defined as follows. 

e{(f>, a) = (B(cr), l(cr) - {f} U {f}) 

where tt is the active program of f, f is the same as <f> except that e' (tt) replaces tt as 
the active program, and e' is an auxiliary function e' : II x E II defined as follows. 

('{tt, (t) = 7t when tt is an atomic action 
e'iir; x, o') = tt'; x where e'iir, a) = tt' 

e'(if a then tt else x) = | ^ ^ 

lx otherwise 

// ■ .1 . ^ f tt; while a do TT ifB((T) h a 

e'(while a do tt) = < . , • 

I yf Otherwise 

One of the peculiarities of PRS-like plan “execution”, in contrast to the execution 
of standard computer programs, is that execution does not necessarily advance the state 
of the computation. This can be seen in the rule for atomic actions, which returns the 
continuation tt, i.e. the same as the initial plan. This is because for an agent embedded 
in the world, execution counts as an attempt to perform the corresponding action, and it 
is only through successful execution of a plan that an agent achieves its goals (the com- 
putation will be advanced if and when an observation confirms successful execution). 
A second feature of the PRS-like interpreter that is a consequence of this rule concerns 
the agent’s actions on the failed execution of an action: no matter how many times an 
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action fails, it is simply retried on the following cycle (assuming no alternative of higher 
priority arises). Thus it is possible for the PRS-like agent to be stuck in an infinite loop 
of continually trying and retrying the same failed action. 

The rules for conditional and iterative statements also embody an assumption about 
PRS-like agents that may lead to undesirable consequences. The rules effectively state 
that execution of a conditional or iterative statement amounts to determining whether 
the condition a in the statement is a consequence of the belief set B or not (so for the 
conditional statement, if a is believed, the then branch is selected, otherwise the else 
branch is selected). However, no action of the chosen branch is taken until a subsequent 
cycle. One reason for this is that the selected action might be a special action of the form 
achieve 7 , in which case its execution requires a further triggering/deliberation cycle. 
The problem this raises is that between the time the branch is selected and the time the 
first action in the branch is executed, the belief set may have changed (and the agent 
might have selected the other branch had the test been done in the new state). Thus in 
this case, the agent’s processing lags behind its observations of the environment. 

3.3 Belief and Plan Update 

Finally, the functions for belief and plan update are specified. Belief revision involves 
the serious complication that approaches to belief revision in the literature, e.g. Garden- 
fors [ 6 ], lead to revision functions that are computationally expensive and, moreover, 
indeterminate (not specified uniquely in terms of just the agent’s beliefs). To address 
both issues, PRS-like systems typically restrict the language of events and beliefs to 
make belief revision both determinate and simple. One such simplification, e.g. Wag- 
ner [17], is to insist that the observations on each cycle correspond to a consistent set of 
literals e. Then revision can be defined, c.f. Gardenfors [ 6 ], as the function that removes 
the complement I of each literal I in e from B (if it is contained in B) and then adds each 
(now consistent) literal ( of e to B. 

Definition 12. The belief update function /3 is the function (3 : 2^ x E ^ E defined 
as follows. 

fi{e, a) = (B((t) - e U e, l(cr)) 

where 

e = {[ : ( e e) 

The plan update function is more straightforward. This function is mainly concerned 
with “housekeeping” which consists of removing those actions whose programs have 
finished execution, or those plans that have achieved their goals or which are believed to 
be impossible to complete. Actions that have completed successfully are those whose 
postconditions are a consequence of the belief set (it is assumed, of course, that the 
beliefs are accurate); in addition, if the active program of a plan is A, the empty pro- 
gram, then that program requires no further execution. Subplans that are believed to be 
impossible are those whose termination conditions are a consequence of the belief set. 

For a concrete hierarchical plan f, let successful{(f>) be the set of subplans ^ of ^ 
for which B((t) h post{4>), and let impossible{(f>) be the set of subplans ^ of ^ for 
which B((t) h termination{4>) (as above, for any plan, the conditions are those of 
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the instantiated plan in the plan library from which the active program of the plan was 
derived). 

Definition 13. The plan update function v is the function v : E ^ E defined as 
follows. 

v{a) = {4>, u){{q{ 4> - v'{4>)) : G l(cr)})) 

where, for each plan f, v'{<f>) is the largest subplan contained in successful{<f)) U 
impossible (<f)) if this exists, or else is the empty plan, f f is <f> with the subplan 
f removed, and for any plan <f>, returns <f> with the active program tt; \ replaced 
by X h post{'ir) for an atomic action tt, and otherwise returns <f> (again, the 

program achieve 7 is assumed to be equivalent to achieve 7 ; A). Finally, the function oj 
removes, from a given set of plans, any plan that consists of a singleton sequence whose 
only element is the program A. 

Note that nowhere is it described when and how the PRS-like agent abandons its 
goals. This is because, whereas the termination condition is associated with a plan, 
there is no corresponding termination condition associated with a goal, so there is no 
belief the agent can use to determine that a goal is infeasible. The only way a goal can 
be abandoned is when it occurs as a subgoal of a plan that is abandoned. 



4 Discussion 

The exercise of developing an operational semantics for an agent programming lan- 
guage of the PRS variety raises fundamental questions about such languages and their 
architectures qua rational agent architectures. This is because intentional notions such 
as knowledge, belief, desire and intention are given specific meanings with reference 
to an underlying architecture in terms of transitions on internal states. The question of 
interest then is whether these so-called BDI agents really have these mental attributes, 
as seems to be commonly presumed. This question can only be answered with respect 
to a specific fheory of beliefs, desires and intentions such as that provided by Brat- 
man [1]. Recall that Bratman characterizes intention with reference to three functional 
roles, e.g. Bratman [1, p. 141]: (i) intentions pose problems for deliberation, i.e. how to 
fulfil them, (ii) existing intentions constrain the adoption of further intentions, and (iii) 
intentions control conduct: an agent endeavours to fulfil its intentions. To some degree, 
the plan structures used by PRS-like agents do possess these roles. For property (i), the 
plans adopted by PRS-like agents may include special actions of the form achieve 7 , 
which can be construed as posing problems to be solved by the agent. For property (ii), 
the structure of a single plan constrains the adoption of further intentions correspond- 
ing to refinements of the plan. For property (iii), the computation cycle of the PRS-like 
interpreter ensures some persistence of intention, normally leading to attempts by the 
agent to fulfil ifs intentions. 

However, Bratman also emphasizes that these are the roles of intention in reasoning, 
i.e. mental processing of a certain degree of complexity is involved in reasoning with 
intentions, and it is this which is absent from the PRS-like agent architecture. In particu- 
lar, Bratman [1, p. 31] emphasizes that intentions play a role in coordinating the overall 




An Operational Semantics for a PRS-Like Agent Architecture 579 



activities of the agent, such that the agent’s overall plan should be strongly consistent 
(believed feasible). But to establish this property of a single course of actions, the agent 
must reconcile the competing requirements of all its adopted intentions in order to en- 
sure that they can all be fulfilled in a coordinated manner - this aspect of determining 
the overall coherence of an agent’s plans is entirely absent from the PRS-like system, 
whilst present (again, to some degree) in a system such as IRMA, Bratman, Israel and 
Pollack [2], Let us take a closer look at the three functional roles of intention with this 
in mind. For property (i), the case of achieve 7 subgoals, the PRS-like agent solves such 
subproblems only when it is time to execute the achieve action, which entails triggering 
relevant plans and selecting one that achieves 7 . Thus such subgoals do not generate 
further intentions or constraints on future intentions adopted during the agent’s inter- 
mediate “planning”. On the other hand, PRS-like agents are assumed to be situated in 
an environment where, perhaps, decisions about how best to achieve a subgoal are best 
deferred until execution time; if so, this is not as the world is commonly assumed. For 
property (ii), while the structure of a single plan constrains the adoption of future in- 
tentions related to that plan, the multiple plans of PRS-like agents are never reconciled 
in one overarching plan, and the plans of the PRS-like agent are not necessarily ever 
mutually compatible. For property (hi), the computation cycle embodied in the PRS- 
like interpreter ensures some persistence of plans, but also that the agent may attempt 
to execute an action incompatible with one of its own plans, leading to the needless 
abandonment of a plan. 

Thus the operational semantics accorded to PRS-like languages and architectures 
makes plain that systems based on PRS have intentions in Bratman’ s sense only to an 
extremely limited extent. Whether PRS-like systems embody BDI agents then turns 
on whether fulfilling the three roles posited by Bratman as functions of intentions in 
humans is a constitutive requirement of intentions, or merely additional to the role in- 
tentions play in human reasoning. In the latter case, an alternative, weaker, theory of 
intention would be needed to justify the claim that PRS-like agents have intentions. 



5 Conclusion and Further Work 

In this paper, we provided an operational description for a simplified language based on 
PRS that is derived directly from a high-level description of its abstract interpreter. This 
goes part way towards bridging the gap between the operational and intentional levels 
of description. In comparison to other operational descriptions of PRS-like systems, 
our language is more expressive in allowing conditional and iterative constructs, and is 
more explicit about action selection, through the use of utility functions to formalize 
this aspect of the computation cycle. 

In future work, we intend to investigate closer connections between the operational 
and intentional levels of description. The main problem to be addressed is that there 
is no clear way of mapping abstract logical models onto the computational states of 
an implemented agent. The work of Rao and Georgeff [14] on modelling intention, of 
Singh [16] on modelling strategies, and of Cavedon and Rao [3] on modelling plans 
provides a starting point, and earlier work showed how to define such mappings in 
simplified cases, e.g. Wooldridge [19], Wobcke [18]. 
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Abstract. We propose a model for the recognition of unconstrained 
digits that may be touched with neighbor ones or damaged by noises 
such as lines. The recognition of such digits seems to be rather paradox- 
ical because it requires the segmentation of them into understandable 
units, but proper segmentation needs a priori knowledge of the units 
and this implies recognition capability. To break up the loop of their in- 
terdependencies, we combine two schemes, hypothesis testing and data 
reconstruction, motivated by the human information system. Hypothesis 
is set up on the basis of the information obtained from the results of the 
basic segmentation, and reconstruction of the information is carried out 
with the knowledge of a guessed digit and then testing for its validity 
is performed. Since our model tries to construct a guessed digit from 
input image it can be successful in a variety of situations such as that 
a digit contains strokes that do not belong to to it, that neighbor digits 
are touched with each other, and that there are some occluding things 
like lines. The recognition results of this model for 100 handwritten nu- 
meral strings belonging to NIST database and for some artificial digits 
damaged by line demonstrate the potential its capacity. 



1 Introduction 

Compared to a isolated one, recognizing a digit among noises or other digits is 
very difficult problem. Moreover, if the digit is touched or overlapped by noises or 
other digits, it even seems to be some paradoxical because it requires to separate 
what is to recognize from other things, but proper separation requires a priori 
knowledge of the patterns that form meaningful units and this implies recog- 
nition capability. Lecolinet and Crettez [1] pointed out that segmenting words 
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before recognizing them can be paradoxical, because without any recognizable 
symbols, one would have to resort to an arbitrary segmentation process, which 
in turn may result in numerous errors. Recognition-based segmentation also has 
the same drawbacks: Recognizing a word that contains illegal characters is even 
more difficult because proper segmentation is not known for the non-recognized 
parts of the word. 

The root cause is that the segmentation and the recognition are interdepen- 
dent and require the results of the others to complete their work as shown in 
Fig. 1(a) (inner loop). 

The rest of this paper is organized in the following manner: Section 2 reviews 
the past work, Section 3 gives an overview of the proposed model, Section 4-6 
describes its modules, experiments are reported in Section 7, and finally, the 
paper is concluded in Section 8. 



2 Review of Past Work 

So far, so many studies have been done to tackle this dilemma. According to 
Casey and Lecolinet [2], there are three main strategies for character segmenta- 
tion, and numerous hybrids of the three. 

The three main strategies are the holistic method, the dissection approach, 
and recognition-based segmentation. The pure holistic method does not require 
segmentation because the system tries to recognize words as a whole. However, 
the holistic method is only applicable to the restricted cases that predefined 
lexicon is available or the string length is known [3]. 

The two remaining ones are basically used for numeral recognition. Most 
recent researches have a tendency to combine them to complement and compen- 
sate for the abovementioned inadequacies that each method exhibits. Arica and 
Yarman Vural’s method [4] considers several candidates for segmentation paths 
which are then confirmed by the Hidden Markov Model (HMM). The source of 
their errors may be traced to missed segmentation regions, i.e. the assumption 
that each digit can be segmented into at most two segments. Yu and Yan [5] 
detect a candidate touching point based on geometrical information. In cases 
where the left or right lateral numeral of a single-touching handwritten string 
can be recognized, recognition information may be used to correct the position 
of the candidate touching point. But this is limited to single touching cases. Ha, 
Zimmermann and Bunke [6] combine segmentation-based and segmentation-free 
methods in a cascading manner to obtain the efficiency of the former method 
and the accuracy of the latter, and to avoid the defects of the latter such as a 
higher computational complexity and the increasing segmentation error with the 
number of numerals. In the segmentation- free module singular points after thin- 
ning are employed as clues to extracting the partial strokes, and then weighted 
graph whose node present the partial strokes and whose arc is the linking cost 
of two nodes is used to recognition. Rocha et al. [7,8] and Filatov et al. [9] all 
use a graph to describe symbol prototypes, but in [7,8] the skeletonization of 
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the initial image with help of singular points and in [9] the predefined symbols 
representing the local graph shape are used to recognize, respectively. 

The main differences between existing methods and ours are as follows: 
Firstly, we use the dynamic tracking method to avoid defects of skeletonization, 
and adopt the principle of good continuation, which is an underlying property 
of perceptual grouping in line segregation, for the line segmentation instead of 
singular points such as an end point, a T-joint, and a crossing point. Secondly 
recognition is carried out by extracting what needs to build up the assumed digit 
from the input image instead of by attempting to determine the most consistent 
combination of the blindly over-segmented subimages. Finally existing works 
for the recognition of characters damaged by lines has the intrinsic limitation, 
namely not extendable to curves or other shapes such as that of stains of ink [10], 
but ours is carried out regardless of the shapes of noises relatively. 

3 Overview of Proposed Model 

As we mentioned, the segmentation and the recognition are interdependent and 
so these are paradoxical. The hint as to a solution to these dilemma can be taken 
from the human information system. 

When the stimulus is given to our sensor we might get the basic features that 
are called primitives - simple, basic units of perception, and then hypothesize 
tentatively based on the primitives that there is the object that is most likely to 
be the cause of our sensory stimulation and then verify the hypothesis by mak- 
ing an attentive analysis of the primitives, their compounds, results of partial 
perception and the knowledge of the assumed object [11,12]. In human case, not 
only the bottom-up information like primitives but also the top-down informa- 
tion such as knowledge about a object play a important role in the recognition, 
and the bottom-up information is positively reconstructed for effective recogni- 
tion. We make the virtual information in the absence of any real stimulus. This 
fact demonstrates that human system has the means to reconstruct the input 
information [13]. 

It is believed that all of these faculties make it possible to segment and recog- 
nize a certain object among other things efficiently. Our model is based on these 
facts above. As shown in Fig. l(a)(outer loop), the rough information is gained 
by the basic segmentation. Since digits are mainly composed of lines we use for 
the basic principle of segmentation the property of good continuation which is a 
fundamental intuitive property of perceptual grouping in line segregation, as ad- 
vocated by the Gestalt school of psychology [11]. Some numerals can be guessed 
on the basis of the rough information from the preceding segmentation, and the 
rough information is reconstructed by the knowledge of the assumed numerals 
and then testing for its validity is performed. 

We define the seven basic elements(BE) according to the shapes of strokes 
that can be obtained by splitting the strokes of numerals by the criterion of 
good continuation, and build up a digit prototype(DP) with the BEs and the 
information of their interrelationship as a priori knowledge. To get the rough 
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information is that the line primitives (LPs) are extracted from the given image 
after preprocessing. Partial perception is carried out through the similarities 
between BEs and LPs. The candidate digits are inferred from the similarities of 
LPs to BEs and their interrelationship information described in DP. The rough 
information from the image is reconstructed to accord with the knowledge of 
assumed digits and then testing for the validity of some candidates is performed 
and the final decision is made. The algorithm proposed in this article can be 
summarized in Fig. 1(b): 



SKEW CORRECTION 



Prerequisite (^^Priori Knowledge 




; INPUT DATA > 



Botton-Up Mechanism 



I PREPROCESSING 



I UP EXTRACTION I 



I IMAGE ENHANCEMENT I 



I DIGIT COMPOSITION I 



Top-Down Mechanism 



I DIGIT PROTOTYPE 1 



MEASURING SIMILARITY I 
BETWEEN BEP AND LP I 



1 



I WEIGHTING BY LPs’ RELATION I 



I RECONSTRUCTION & RECOGNITION 



(b) 



Fig. 1. (a)Segmentation and recognition model, (b) Organization of proposed model 



4 Preprocessing 

4.1 Stroke Width Estimation 

In our application lines are main objects, and so stroke width is one of important 
parameters to handle images. But it can vary locally depending on writing de- 
vices and pressure within a line. It is reasonable, therefore, to use average stroke 
width. Through projection of horizontal and vertical the number of occurrences 
of each stroke width are obtained and then average stroke width is estimated 
base on the peak of histogram which is come from the frequency of stroke width. 
In case there are many small fragments this estimation may be wrong. So noise 
removal and defragment method must be followed by this estimation. 

4.2 Skew Correction 

Handwritten numeral strings often have a number of slants specific to each writ- 
ing style. This makes it difficult to extract invariant features from the strings 
and analyze them. Our slant estimation is performed by extracting the contour 
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pixels from the entire string as shown in Fig. 2(b) and by calculating their gra- 
dients using a gradient operator like Sobel. Applying the gradient operator to 
only the contour pixels, which gives the most information about its direction, 
helps to save time. Estimated global slant is the average of all pixels’ angles 
between 15° and 165° or 195° and 345°(Fig. 2(c)), weighted by their length in 
the vertical direction. The longer the line, the more accurate the angle [14]. The 
skewed string is corrected based on the estimated slant angle 0 by adjusting 
x-coordinates of all black pixels with Eq. (1). 

Xn = X — {y — Height/2) x tan(6) (1) 

Vn = y 

where Xn and yn are the coordinates after the correction of a component at (x,y). 

There are two problems incurred during the skew correction. First, some 
digits may touch their neighbors after the slant correction because of the new 
8-connected neighbor relationship between the boundary pixels of them. Second, 
an error in quantization may result in contour jaggedness. 

We solve the first problem by applying a connected component technique to 
the original string images before the correction. Connected components with dif- 
ferent indices may be treated as distinct entities even though they may actually 
be connected after the slant correction. 

The second problem is overcome using the linear interpolation method [15]. 
The linear interpolation assumes that the contribution of a pixel in the neigh- 
borhood varies directly with its distance. As a result of the transformation a 
pixel is given a noninteger address. The intensity of the transformed black pixel 
is derived from the four nearest pixels according to their relative distances from 
the calculated address of the transformed pixel. Tow-dimensional interpolation 
is performed. If the intensity of the transformed pixel is less than a threshold, 
then it should be given O(white) and otherwise l(black). Fig. 2(a) and (d) show 
the original image and the slant-corrected image, respectively. 



S3W3 



(a) 









(c) 







Fig. 2. Skew handling: (a) an original image, (b) a result of edge extraction, (c) Edge 
points with the vertical factor, (d) an image after skew correction 
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4.3 Image Enhancement 

We need to link together small fragments separated from main strokes and re- 
move noises. The small gaps and fragments of strokes caused by input devices 
or imperfect binarization may result in errors when analyzing the structure of 
digits, especially in the case of a connected component-based method like ours. 
We adopt the selective region growing method [16], which dynamically selects 
one of four differently shaped neighborhood operators, based on the properties 
of the neighborhood. As in the case of skew correction, when digits in the in- 
put string are very close to each other, region-growing methods usually make 
the mistake of linking two regions (digits). This problem can be overcome using 
the connected component information. Only if Eq. 2 is satisfied, two connected 
components under consideration can be joined together. 



Ci.i = C2.i ( 2 ) 

Ci.s + C 2 -S < AveragC-Blob-Size 
{Ci-lx < C 2 -lx and C 2 -rx < C\.rx) 
or {C 2 -lxi < Ci-lx and Ci.rx < C 2 -rx) 



where C.i is the connected component index, C.s is the size of the component, 
C.lx and C.rx are the left or right x-coordinate of the component, respectively, 
Average_Blob_Size refers to the average number of foreground pixels in the com- 
ponents whose height is greater than, or equal to, their width. The example of 
image enhancement is shown in Fig. 3. 



(a) (b) 



Fig. 3. Image Enhancement: (a) an original Image, (b) an enhanced image 



4.4 Digit Composition 

At times a connected component may only represent a piece of an individual 
digit. Numerals may be broken into pieces due to poor print or scan quality, 
or their strokes may be written in such a way that they appear detached. This 
frequently happens in the case of digits with long horizontal strokes such as ‘4’ 
and ‘5’. So these components should be merged together in order to construct 
complete digits. The task of composing digits is carried out by the same criteria 
in Garris’ work[17j. 
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5 Segmentation for Rough Information 

5.1 Stroke Representation 

A scanned image of a drawing is initially an ordered array of pixels representing 
the an average intensity over a small region. The ability to build up a repre- 
sentation from these individual pixels which exploits relationships such as local 
proximity and highlights the structure of the underlying components is impor- 
tant for the extraction of features during interpretation and recognition. In our 
model we use a dynamic line tracking method which do not need a line fitting. 
As shown in Fig. 4, a circularly symmetric Gaussian bead(GB) with a variable 
radius is used as a processing unit instead of a pixel and a graph is built up with 
the GBs. The fitness of each foreground pixel in the image is determined inde- 
pendently from a distribution defined by circularly symmetric Gaussian beads. 
The bead that has the highest fitness in the searching area is chosen as new 
neighbor one. Each node of the graph represents GB and its arc between two 
nodes stands for their connection and direction. The local direction is the same 
as that of a line passing through center of two GBs. The bead whose diameter 
is over the average stroke width has virtual multi-center. 




Fig. 4. Stroke representation and construction 



5.2 Method 

Ghen and Hsu [18] have proven that good continuation is feasible and reason- 
able for smooth line segregation, and satisfied if a line has smooth property of 
orientation over it. But their hypothetical model is implemented using pixel- 
based operation. This makes it difficult to interpret the line being segregated as 
a meaningful unit. The ability to interpret is required for proper segmentation as 
described in the previous section. So we employ the GB for stroke representation 
as described above. 
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Two steps are taken to get smooth lines from the graph obtained in the 
previous stage. In the first step, straight lines are extracted by gathering the 
nodes whose direction differences are less than the threshold which is determined 
by the diameters of the current beads. In the next, smooth curves are obtained 
by means of merging straight lines into big ones if they have monotonous and 
smooth changes of direction. Fig. 5 (a) shows the instances of the straight lines 
and the smooth curves. 

In addition to the smooth lines, circles and half circles whose smoothness 
properties are violated by a specific area need to be detected when the prototype 
of the assumed digit is composed of them(Fig. 5 (b)). Two kinds of LPs, namely 
smooth and not smooth, are all used as units in matching between input data 
and the digit prototypes, as described in the following section. But their scopes 
of application are different: The LPs with a smooth property are used for all 
digit prototypes, additional LPs are used exclusively for those prototypes which 
have circle(s) or half circle(s). 




Fig. 5. Extracted line primitives: (a) straight lines, (b) circles and half circles that have 
no smoothness but monotonous changes of angle on the whole 



6 Hypothesis Testing and Data Reconstruction 

There are two main processes, top-down and bottom-up, for hypothesis test- 
ing as shown in Fig. 6. Among all LPs, including additional half circles and 
circles, we search the LPs homomorphic for the elements of previously defined 
prototype(BEPs) with the knowledge of the interconnecting relationships of the 
BEPs, such as their position, size and connection. 



6.1 A Priori Knowledge 

The abstract seven elements are defined as the basic element of prototype (BEP). 
They are composed of three smooth curves and four straight lines : circle. 
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Fig. 6. Top-down and bottom-up mechanism 



left(right) open half circle, vertical line, horizon and two diagonal. Their types 
are derived from the shapes of lines that can be obtained by splitting the strokes 
of undistorted digits by the criterion of smoothness. 

The contents of prototype (DP) for each digit are composed of four fields, 
namely, basic elements, size relationship, connection relationship and position 
relationship. Knowledge about which BEPs are required to build up the digit 
is stored in the the basic elements field. The relative size and relative position 
information of the BEPs in the DP are stored in the corresponding fields, re- 
spectively. On the other hand, connection information is represented by a table 
in the connection field. 

6.2 Digit Inference 

To find out which DPs correspond to an input image, the similarity of each LP 
to BEP should first be calculated individually and then evaluated that of each 
combination of LPs to the DPs. We use not only LP matching scores but also 
the weights of their relationship in evaluating the total scores for each of DPs. 
The steps taken are summarized as follows:. 

(i) Let Sij be the similarity of LPi to BEPj and define as follows: 

{ Cl X SizeRatiOij if BEPj is curve 

X AngleScopCi /AngleScopej 

C 2 X SizeRatiOijX otherwise 

Average{90 — \ AngleScopei — AngleScopej\) 

where Ci _2 is constant, AngleScopCj is the angle range of the predefined 
BEPj, AngleScopCi is the angle range of LPi belonging to AngleScopCj 
and SizeRatioij is LPi / BEPj. 

(ii) Make a similarity table by evaluating Sij for all i and j, 0 < i < # of 
extracted LPs, 0 < j < # of predefined BEPs. 
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(iii) Let Wk be the weighting factor for DP^ and it is determined in proportion 
to the correspondence between the relation of BEPs of the DPk and that of 
LPs. DPk is the predefined prototype for a digit k, 0 < k < 9. 

(iv) Search the similarity table and Wk for LPs whose total scores are maximum. 

(v) Select the DPk having the highest score taking account of the Wk- 



6.3 Knowledge Based Reconstruction and Recognition 

For each of DPs, we select the well-matching LPs, taking into account their simi- 
larities to the BEPs and weighting values based on the relationship among them. 
Once the candidates of numerals are determined, it is possible to reconstruct the 
rough information obtained through the basic segmentation on the basis of their 
knowledge. The most reliable LP of a candidate is taken as starting point and 
LPs are reconstructed to coincide with the BEP of its DP as possible. In our 
model segmentation is done naturally during the reconstruction. Fig. 7 shows 
the examples of segmentation through reconstruction. Finally, we select the best 
one among candidates whose total scores are recalculated on the basis of the 
reconstructed LPs in the same way as before. 

The total scores have four parts: total sum of match scores(TM), each score 
of LPs corresponding BEPs of DP(M), weighting scores between two LPs(W) 
and indices of LP used(P). To decide whether the input blob is touching and 
needs to segment or not, we first assume that each DP whose TMs is above a 
threshold and relatively large represents its own single digit and then investigate 
the extent of LPs’ overlapping. In cases where there is little or no overlapping, 
each of them represents a single digit and segmentation is done naturally. If the 
shared parts of LPs that belong to different digits are too big for an individual 
digit then it is one digit having the the highest matching score. If some LPs share 
a bit which is significant enough to affect the shape of BEP heuristic knowledge 
is applied to segmentation and recognition. Because it is known that the shared 
parts are which part of the digits segmentation and recognition can be safely 
done. For example, as shown in Fig. 7(a), the shared parts is understood to be 
horizontal stroke and so it is little risk to split it in two. 



7 Experimental Results 

The proposed model is tested using the artificial digits damaged by line and the 
NIST SD3 database, provided by the American National Institute of Standards 
and Technology (NIST) in 1992. We use 10 artificial digits and 100 numeral 
strings with 2 ~10 lengths, where 21 strings have broken strokes, 29 contain small 
fragments, 34 exhibit two-digit touching and 1 data has three-digit touching. In 
our model one prototype per class is used to recognize numerals. The results show 
that 95.5% of all strings are correctly recognized at zero-reject level. The details 
are summarized in Table 1. The causes of errors are incorrect LP extraction and 
digit composition failure. 
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Fig. 7. Reconstruction and recognition: (a)Example of touching digits, (b)Example of 
digits damaged by line 



Table 1. Recognition results 



String length Number of Recognition 
tested string rate(in %) 



Artifical digit 


10 


100 


2 


24 


92 


3 


21 


100 


4 


21 


95 


5 


17 


94 


6 


12 


92 


10 


5 


100 


Total 


110 


95.5 



8 Conclusions 

Segmentation and recognition are mutually dependent. Proper segmentation 
would need the help of a recognizer, who in turn can give critical help only 
if segmentation is well done. So we combine two schemes, hypothesis testing 
and data reconstruction, motivated by the human information system. Since our 
model tries to construct the assumed digit from input image it can be successful 
in a variety of situations such as that a digit contains strokes that do not belong 
to to it, that neighbor digits are touched each other, and that there are some oc- 
cluding things like lines. The experimental results demonstrate the performance 
and capacity of our model. 
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Abstract. Data collecting is necessary to some organizations such as 
nuclear power plants and earthquake bureaus, which have very small 
databases. Traditional data collecting is to obtain necessary data from 
internal and external data-sources and join all data together to create a 
homogeneous huge database. Because collected data may be untrusty, it 
can disguise really useful patterns in data. In this paper, breaking away 
traditional data collecting mode that deals with internal and external 
data equally, we argue that the first step for utilizing external data is 
to identify quality data in data-sources for given mining tasks. Pre- and 
post-analysis techniques are thus advocated for generating quality data. 
Keywords: Data preprocessing, data collecting, data mining, quality 
data, data sharing. 



1 Introduction 

In knowledge discovery in databases (KDD), data preprocessing includes data 
collecting, data cleaning, data selection, and data transformation [4]. So data 
collecting is very important in the process of knowledge discovery in databases. 
It is to obtain necessary data from various internal and external sources and 
join data together to create a homogeneous huge dataset. Data preprocessing [2] 
may be more time consuming and presents more challenges than data mining. 

In existing techniques, while internal and external data are together joined 
into a single dataset for mining tasks, they play equally roles in the dataset. 
However, because collected data may be untrusty even fraudulent, it can disguise 
really useful patterns in data. In particular, if external data is not preprocessed 
before it is applied, it causes that identified patterns from data can conduct an 
application high-risk. 

In this paper, breaking away traditional data collecting mode that deals with 
internal and external data equally, we argue that the first step for utilizing exter- 
nal data is to identify quality data in data-sources for given mining tasks, called 
as quality data model. Pre- and post-analysis techniques are thus advocated for 
generating quality data. Due to the fact that only relevant, uncontradictable and 
high-trusty data-sources are suggested to be mined in our approach, it can not 
only reduce the search cost, but also generate quality patterns. The approach is 
particularly useful to companies/organizations such as nuclear power plants and 
earthquake bureaus, which have some very small databases but require trusty 
knowledge for their applications. 
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The rest of this paper is organized as follows. We begin with stating the prob- 
lems statement in Section 2. In Section 3 we present effectiveness of collecting 
quality data. In Section 4, we illustrate how to identify believable data-sources by 
pre-analysis. Section 5 also advocates a post-analysis for identifying believable 
data-sources. Section 6 evaluates the effectiveness of the proposed framework. 
We summarize this paper in last section. 



2 Problems Statement 

In this section, we formulate the problems in data collecting and our approach. 



2.1 Problems Faced by Data Collecting 

Traditional data collecting among data-sources is directly to borrow data from 
external data-sources to form a big dataset for a given mining task. This means 
that internal and external data play equally important roles in the mining task. 

Indeed, data collecting is necessary to some companies/organizations such as 
nuclear power plants and earthquake bureaus, which have very small databases. 
For example, because accidents in nuclear power plants cause many environmen- 
tal disasters and create economical and ecological damage as well as endangering 
people’s life, automatic surveillance and early nuclear accident detection have 
received much attention. To reduce nuclear accidents, we need trusty knowledge 
for controlling nuclear accidents. However, a nuclear accident database often con- 
tains data too little to form trustful patterns. So, mining the accident database 
in the nuclear power plant must depend on external data. 

Also, a company that has a large database may want to collect external data 
for high-profit purpose when a decision is made. So, employing external data has 
become a challenging topic in data mining. 

Joining all data together from internal and external data-sources directly to 
form a single dataset for mining task has three main limitations below. 

1. Low-quality (including noisy, erroneous, ambiguity, untrusty, and fraudulent) 
data disguises really useful patterns in data. 

2. Which of the collected data-sources are relevant to a given mining task is 
not made clear. In other words, data in irrelevant data-sources plays equally 
important role in the mining task. 

3. Also, it doesn’t confirm that which of collected data-sources are really useful 
to the mining task. 

For the sake of noise and related issues, external data is certainly dirty. Dirty 
data can disguise really useful patterns in data and cause failing applications. For 
example, if a stock investor gathers fraudulent information and the information 
is directly applied to his/her investment decisions, he/she may, however, go 
bankruptcy. Hence, it is very important that selects quality external data. 

Based on the above analysis, the problem for our research can be formulated 
as follows. 
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Given a mining task on dataset DS and n data-sources collected for the 
mining task, we are interested in identifying quality data in external data 
by pre-analysis and post-analysis. 

There are diverse techniques useful to other steps of the process of KDD in 
such as [ 5 ] [6]. This paper focuses on identifying believable data-sources collected. 



2.2 Our Approach 

This paper proposes an approach for identifying believable external data-sources 
as the first step of utilizing external data, which is toward databases mining. 

1 . Pre-analysis, which is an insight into such as the relevant and uncontra- 
dictable data-sources collected. It is useful when we have no any other in- 
formation about the data-sources. 

2 . Post-analysis, which is to learn the data-sources upon historical data (train- 
ing set). 

Our experiments show that our quality data model is effective and promising. 

3 Effectiveness of Identifying Qnality Data 

To show the effectiveness of identifying quality data, an example is used below. 
Without loss of generality, we often call a data-source as a database or a relation. 

Consider an internal database (data-source) ID = {{A, B,C); (A,C)} and 
six external (collected) databases (data-sources) Di, D2, ■ ■ ■, Dq as follows. 



= {(A, B, C, D); {B, C); (A, B, C); (A, C)} 

D2 = {(A, B); (A, C); (A, B, C); {B, C)- (A, B, D)} 
Ds = {{B, C, D); (A, B, C); {B, C); (A, D)} 

Di = {(A, F, G, H, I, J); {E, F, H); {F, H)} 

D, = {{B, E, F, H, J); (T, H); (F, H, J); (F, J)} 

Fe = {(C, F, H, I, J); (F, H, J); (F, F, H); (F, /)} 



where each database has several transactions, separated by semicolon; each 
transaction contains several items, separated by commas. 

Let minsupp = 0 . 5 . We can search local frequent itemsets in Di as follows: 
A, B, C, AB, AC, BC, and ABC, where “AT” means the conjunction of X 
and Y . Local frequent itemsets in D2 are searched as follows: A, B, C , and AB. 
Local frequent itemsets in D3 are searched as follows: A, B, C, and BC. 

Local frequent itemsets in D4 are searched as follows: F, H, and FH . Local 
frequent itemsets in F5 are searched as follows: E, F, H, J , EJ, FH, F J, and 
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FHJ. Local frequent itemsets in Dq are searched as follows: E, F, H, I, J, EH, 
FH, and HJ. 

Local frequent itemsets in internal database ID are searched as follows: A, 
B, C, AB, AC, BC, and ABC. 

Let’s examine the existing techniques check if some of them can serve the 
purpose of selection, pretending no knowledge about which database contains 
interesting information. 

1. The first solution (traditional data collecting technique) is to put all data 
together from the given database and the six collected databases to create a 
single database TD = ID U DiU D 2 U ■ ■ ■ U Dq, which has 26 transactions. 
We now search the above (local) frequent itemsets in TD listed in Table 1. 



Table 1. The information of local frequent itemsets in the database TD 



Itemsets 


Frequency 


> minsupp 


Itemsets 


Frequency 


> minsupp 


A 


12 


n 


B 


12 


n 


C 


13 


y 


AB 


7 


n 


AC 


8 


n 


BC 


9 


n 


ABC 


5 


n 


E 


6 


n 


F 


8 


n 


H 


9 


n 


I 


3 


n 


J 


6 


n 


EH 


4 


n 


EJ 


3 


n 


FH 


8 


n 


HJ 


5 


n 


FHJ 


4 


n 









There is only one frequent itemset C when minsupp = 0.5. To discover the 
database TD, we need another minimum support specified by users or experts. 
For example, minsupp — 0.115. Then all the above itemsets listed in Table 1 
are frequent itemsets. And itemsets such as AD, BD, and EF are also frequent 
itemsets in TD. 

Actually, in the above six external databases, only the former three databases 
are likely relevant to the internal database. The later three databases are unlikely 
relevant to the internal database. The technique developed in this paper can meet 
the requirement of the above application. It is regarded as the third solution as 
follows. 

2. The second solution is the quality data model proposed in this paper. The 
approach works as follows. Firstly, it selects believable databases: classi = 
{Di,D 2 ,D'i\. Secondly, the databases and internal database are put into a 
single database TDi, which TD\ has 13 transactions. Finally, it discovers 
TD\. In this way, we can receive a better effectiveness from the quality data 
model. Table 2 illustrates the effectiveness of identifying quality data in 
TDi. 
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Table 2. The information of local frequent itemsets in the database TDi 



Itemsets Frequency > minsupp 


Itemsets Frequency > minsupp 


A 11 y 

C 12 y 

AC 8 y 

ABC 5 n 


B 11 y 

AB 7 n 

BC 9 y 



By minsupp = 0.5, A, B, C, AC, and BC are frequent itemsets in TDi. 
From the above, commonly used technique can not only increase search cost, 
but also disguise the useful patterns due the fact that huge amounts of irrelevant 
data are included, quality data model presented a significant effectiveness. The 
following sections will explore basic techniques for identifying quality data. 

4 Data-Source Pre-analysis 

As have seen, quality data model is toward database mining. So, we propose 
to determine which of data-sources are trusty by pre- and post-analysis in this 
paper. 

For a given data-source DS and the set DSSet of collected data-sources DS\, 
DS 2 , ■ ■ •, DSm, we first pre-analyze the data-sources from DSSet using their fea- 
tures and rules when we have no any other information about the data-sources. 
It is to select external data-sources that are relevant and uncontradictable to 
DS. 



4.1 Relevant Data-Sources Selecting 

Let Feature(DSi) be the set of all features ^ in DSi {i = 1, 2, • • • , m). We need to 
select data-sources from DSSet = {DSi, DS 2 , • • • , DSm} for DS such that each 
data-source is relevant to a data-source DS under a measurement. The features 
of data-sources can be used to measure the closeness of a pair of data-sources. 
We call the measure as sim that is defined as follows. 

1. A function for the similarity between the feature sets of two data-sources 
DSi and DSj is defined as follows. 

■ D‘> \ \Feature{DSi) (1 Feature{DSj)\ 

^ \Feature{DSi) U Feature{DSj)\ 

where “fl” denotes set intersection, “U” denotes set union, “\Feature{DSi)f) 
Feature{DSj)\” is the number of elements in set Feature{D Si)nFeature{D Sj) . 

^ The features of a data-source is often selected from its data. If we can only share 
the rules (patterns) of the data-source, the features of the data-source can selected 
from the rules (patterns). 
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In the above definition of similarity sim : DSSet x DSSet [0, 1], we take 
the size of the intersection of a pair of the feature sets of data-sources to measure 
the closeness of the two data-sources. That is, a large intersection corresponds 
to a high degree of similarity, whereas two data-sources with a small intersection 
are considered to be rather dissimilar. 

We now illustrate the use of the above similarity by an example as follows. 



Example 1 Let Feature{DSi) = {ai,a2,a3} and Feature{DS2)= {02 ,03,61, 62 } 
be two sets of features of two data-sources DS\ and DS2, respectively. The sim- 
ilarity between DS\ and DS2 is as follows. 



sim{DSi,DS2) 



\Feature{DSi) n Feature{DS2)\ 
\Feature{DSi) U Feature{DS2)\ 




Note that, if sim{DSi,DSj) = 1, it only means that Feature(DSi) = 
Feature(DSj) or, DSi and DSj can be certainly relevant under measure sim. 
It doesn’t mean that DSi = DSj when sim{DSi,DSj) = 1. 

We have proposed a simple and understandable function for measuring the 
similarity of pairs of data-sources. Certainly, we can construct more functions 
for similarity using such as the weights of features. It is not the goal of this 
paper. Our work in this paper is only to advocate how to construct measures 
for similarity. Using the above similarity on data-sources, we define data-sources 
a-relevant to DS below. 



Definition 1 A data-source DSi is a-relevant to DS under the measure sim if 
sim{DSi,DS) > a, where a (> 0) is a threshold. 

For example, let a = 0.4. Consider the da.ta, Feature{D Si) = {ii, 12, *3, ii, *5} 
and Feature(DS) = {ii, is, 14, *5, ie, *7}) because sim{DSi,DS) = 0.571 > a = 
0.4, the data-source DSi is 0.4-relevant to DS. 

Definition 2 Let DSSet be the set of m data-sources Di, D2, • • • , D^. The set 
of the selected data-sources in DSSet that are a-relevant to a data-source DS 
under the similarity measure sim, denoted as RDS{DS, DSSet, sim, a), is de- 
fined as follows: 

RDS{DS,DSSet,sim,a) = {ds € DSSet\ds is a-relevant to DS}. 



4.2 Uncontradictable Data-Sources Selecting 

Selecting relevant data-sources considers their features. Also, we can check the 
contradiction between pairs of data-sources by comparing their knowledge if we 
have no any other information about the data-sources. For two data-sources DSi 
and DSj, they are contradictive if there is at least one proposition A such that 
A holds in DSi and A holds in DSj . A is called as a contradictive proposition 
in data-sources DSi and DSj. We use the ratio of contradictive propositions 
in data-sources DSi and DSj to measure the contradiction between the two 
data-sources. We now define a function for contradiction contrad below. 
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Let Rule(DSi) be the set of all propositions in DSi (i = 1, 2, • • • , m) . We need 
to select data-sources from DSSet — {DS\,DS 2 t - • • ^DSm} for DS such that 
each data-source is uncontradictable to a data-source DS under a measurement. 

2. We can construct contradiction contrad by the ratio of contradictive propo- 
sitions in data-sources DSi and DSj as follows. 

uumbor of coutradictlve propositions in DSi and DSi 

c^truHDS..DS,) = \RuHDsIjRuHDS,)\ " 

In the above definition of contradiction contrad : DSSet x DSSet [0, 1], 
we take the number of contradictive propositions in data-sources to measure the 
contradiction of the two data-sources. That is, a large number of contradictive 
propositions correspond to a high degree of contradiction, whereas two data- 
sources with a small intersection are considered to be rather uncontradiction. 

We illustrate the use of the contradiction contrad by an example below. 

Example 2 Let Rule{DS^ = {A,B,^C,D} and Rule{DS 2 ) = {A,^B,C,E,F} 
he two sets of propositions of two data-sources DSi and DS 2 respectively. The 
contradiction between DSi and DS 2 is measured as follows. 



contrad{DSi, DS 2 ) = 



number of contradictive propositions in DSi and DS 2 
\Rule{DSi)URule{DS 2 )\ 



= - == 0.3333. 
6 



Using the above contradiction on data-sources, we define data-sources f3- 
uncontradictable to DS below. 



Definition 3 A data-source DSi is fd -uncontradictable to a data-source DSj 
under the measure contrad if 1 — contrad{DSi, DSj) > (3, where (3 (> Oj is a 
threshold. 

For example, let (3 = 0.8. Consider the data in Example 2, because 1 — 
contrad{DSi, DS 2 ) = 1 — 0.3333 = 0.6667 < (3 = 0.8, the data-source DS\ is 
not 0.8-uncontradictable to DS 2 . 



Definition 4 Let DSSet be the set ofm data-sources DSi,DS 2 ,- • • , DSm . The 
set of the selected data-sources in DSSet that are f3 -uncontradictable to a data- 
source DS under the contradiction measure contrad, denoted as UDS{DS, DSSet, 
contrad, (3), is defined as follows: 

UDS{DS, DSSet, contrad, f3) = {ds G DSSet\ds is a-uncontradictable to DS}. 
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Table 3. Past data of using external knowledge 





DSI 


DS2 


DS3 


DSi 


result 


Ol 




1 


1 


1 


yes 




1 


1 


1 


1 


yes 


03 


1 


1 






no 


04 


1 




1 




no 


05 


1 




1 




no 


06 


1 






1 


yes 


07 




1 


1 


1 


yes 


Og 


1 


1 


1 




yes 


Og 


1 


1 






no 


Oio 


1 


1 


1 


1 


yes 



5 Data-Source Post-analysis 

When we have some information such as cases of applying external data-sources 
(it is often a training set), collected data can be post-analyzed. Suppose we 
have applied external data-sources DSl, DS2, DS3, and DSi for ten real-world 
applications in Table 3. 

where, DSi stands for the fth data-source; at indicates the ith application; “1” 
stands for that the knowledge in a data-source is applied to an application, 
we use DSi = 1 to indicate that ith data-source is applied to an application; 
“result” measures the success of the applications, “result = yes” means that an 
application is successful and, “result = no” means that an application is failure. 
For application oi, three data-sources DS2, DS3, and DSi have been applied. 

Using historical data in tables such as Table 3, we can post-analyze the 
collected knowledge and determine which of data-sources are trusty and which of 
patterns collected are believable. The above instance only elucidates how to use 
possible information to judge the trustfulness of a data-source. If we can obtain 
more information, we can make a judgement on trustfulness by synthesizing. 

We now advocate a method for solving trusty degrees of data-sources by the 
above historical data. The cases of applying the four data-sources DSI, DS2, 
DS3, and DSi in Table 3 are listed in Table 4. 



Table 4. The cases of applying the four data-sources DSI, DS2, DS3, and DSi 





frequency 


success 


fail 


success-ratio 


DSI 


8 


4 


4 


0.5 


DS2 


7 


5 


2 


0.714 


DS3 


7 


5 


2 


0.714 


DSi 


5 


5 


0 


1 
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where, frequency is the number of applications that use a data-source; 
success is the success times of applications when a data-source was applied; 
failure is the fail times of applications when a data-source was applied; “success- 
ratio” is success/ frequency. 

From the above table, DSl was applied 8 times with success-ratio 0.5, DS2 
was applied 7 times with success-ratio 0.714, DS3 was applied 7 times with 
success-ratio 0.714, DSi was applied 5 times with success-ratio 1. 

Certainly, we can use the success-ratios to determine the trusty degrees of the 
data-sources. One way is to normalize the success-ratios as the trusty degrees of 
the data-sources below. 



tdosi 



tdDS2 



tdoss 



tdosi 



0.5 

0.5 -b 0.714 4- 0.714 -b 1 
0.714 

0.5 + 0.714 -b 0.714 -b 1 
0.714 

0.5 -b 0.714 -b 0.714 -b 1 
1 

0.5 -b 0.714 -b 0.714 -b 1 



0.167, 

0.238, 

0.238, 

0.357, 



where tdosi stands for the trusty degree of the ith data-source {i = 1, 2, 3,4). 

We have seen that, data-source DSA has the highest success-ratio and it has 
the highest trusty degree; DSl has the lowest success-ratio and it has the lowest 
trusty degree. 

Furthermore, the the trusty degree of DSi {i = 1, 2, • • • , n) can be defined as 
follows. 



success-ratio of DSi 
DSt success-ratio of DSj 

6 Algorithm Designing 

In our approach, we focus on only three factors: relevance, uncontradictability, 
and trustfulness when data-sources are ranked. Other factors are as similar as 
the above. To synthesize the three factors for ranking, we can use weighting 
techniques in [3]. We now design the algorithm of ranking the external data- 
sources by pre- and post-analysis as follows. 

Algorithm 1 Data- sources Rank 

begin 

Input.- DS: data-source; DSi: m data-sources; 

Output.- S: a set of data-sources; 

(1) input the collected data-sources DSi relevant to DS; 

(2) transform the data in each data-source into rules; 

(3) pre-analyze the data-source DSi, ■ ■ DSm', 




602 C. Zhang and S. Zhang 



(4) rank the data-sources by synthesizing the pre-analyzing results de- 
creasingly; 

(5) post-analyze the data-sources according to the ranking by pre- 
analysis; 

(6) rank the data-sources by synthesizing the post-analyzing results de- 
creasingly; 

(7) let S ^ all high-rank data-sources; 

(8) output S; 

end 

The algorithm Data-sourcesRank is to rank the collected m data-sources 
DSi,DS 2 , ■ ■ ■ , DSm relevant to the data-source DS according to the proposed 
framework, where S is the set of all high-rank data-sources. 

Step (1) inputs the collected data-sources DSi,DS 2 , - ‘ ‘ jDSm relevant to 
the data-source DS. Step (2) transforms the data in data-sources into rules by 
mining for the purpose of uncontradiction analysis. Step (3) pre-analyzes the 
data-sources using their features and knowledge to select data-sources that are 
relevant and uncontradictable to DS. Step (4) first synthesizes the results of 
pre-analysis by weighting and then rank the external data-sources according to 
the synthesizing results decreasingly. Step (5) is to generate the trusty degrees 
of the selected data-sources in Step (4) by historical data. For convenience, Step 

(6) ranks the data-sources by synthesizing the pre-analysis (including relevance 
and uncontradictability) and post-analysis (trusty degrees) decreasingly. Step 

(7) selects all high-rank data-sources and saves them into S. And the final result 
S is output in Step (8), where the data-sources in S are suggested to user as 
believable data-sources. 

7 Experiments 

To evaluate the effectiveness of the proposed framework, we have done some 
experiments by Java in DELL. Our experiments are designed to test the ef- 
fectiveness of the proposed approach in applications, which is with respect to 
the preprocessing by the algorithm Data-sourcesRank from three aspects: rel- 
evance, uncontradiction, and trustfulness of external data-sources by pre- and 
post-analysis. 

We select 10 data-sources, in which each data-source has a set of rules. 8 data- 
sources of them are trusty. 2 data-sources of them contain rules contradictable 
to a given data-source, where the contradictable rules always cause failing appli- 
cations, which is used to demonstrate the profit from the proposed framework. 
The parameters of experimental data-sources are summarized as follows, 
where DSi is the fth data-source, “size” is the number of rules in a data-source, 
“trusty” is the trustfulness of a data-source. “yes” indicates trusty, and “no” 
stands for non-trusty. 

Firstly, we can classify the data-sources into three classes Ci = {US'!, DS2, 
DS4, DS5, DS6, DS8, DS9, DS'lO}, C2 = {US'S}, and C3 = {DS7} according 
to the relevance and contradiction. 
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Table 5. The Experimental data-sources 



data-source size trusty 


data-source size trusty 


DSl 41 yes 

DS3 25 no 

DS5 57 yes 

DS7 44 no 

DS9 40 yes 


DS2 50 yes 

DS4 34 yes 

DS6 23 yes 

DS8 29 yes 

DS'lO 18 yes 



TD-success-ratio — i — 
NOTD-success-ratio — x— 



3 4 

number of data-sources 



Fig. 1. The success ratios of TD and NOTD 



Secondly, we use the rules of data-sources in a class as possible for applica- 
tions. After several applications, we rank the data-sources in a table by trust- 
fulness. The data-sources in the class Ci have high-rank and the data-sources in 
the class C2 and C3 have low-rank. 

Thirdly, the data-sources in the class Ci are recommended to be trusty. And 
then the rules in the data-sources are used to applications. 

We have done two sets of experiments for four classes of applications of 
a data-source DS, where the data-source has 14 rules. One is that DS uses 
only rules from trusty data-sources, called as TD. Another is called as NOTD, 
which DS randomly borrows external rules from other data-sources. Each class 
of applications consists of ten reasoning tasks. The first class of application 
needs rules from 2 data-sources. The second class of application needs 3 from 
two data-sources. The third class of application needs rules from 5 data-sources. 
The fourth class of application needs rules from 6 data-sources. The success 
ratios of TD and NOTD are depicted in Figure 1. 






604 C. Zhang and S. Zhang 



In Figure 1, TD model received a 100% success-ratio because (1) the pro- 
posed technique is utilized and (2) the given reasoning tasks can be finished in 
class Cl- NOTD model obtained a low success-ratio decreasingly because the 
fraudulent rules in DSi and DSl are also used to the tasks. 

8 Conclusions 

To our knowledge, little work on identifying believable external data-sources has 
been reported in current literature. However, the efforts on feature selection [1] [6] 
and data cleaning [7] seem quite related to this work. 

Feature selection is the process of choosing features which are necessary and 
sufficient to represent the data. Data cleaning is to detect and remove errors, 
inconsistencies, contradictions, and redundancies from data and, eliminate irrel- 
evant data in order to improve the quality of data. 

Certainly, when multiple (internal and external) data-sources need to be 
integrated for an application, the need for feature selection and data cleaning 
increase significantly. However, data in external data-sources may be untrusty 
even fraudulent. Because the data may be relevant to an application, it can 
disguise the really useful patterns useful to the application. In this case, previous 
data preprocessing methods don’t work well. 

To make use of discovered patterns, we proposed to pre-analyze and post- 
analyze external data-sources so that only quality data is used to mining tasks. 
As have seen, the experimental results manifest that the proposed approach can 
effectively improve the performance of utilizing external data-sources. 

The proposed approach is different from feature selection [1][6] and data 
cleaning [7] because (1) we distinguish internal data from external data; (2) 
our operating objects may be datasets (data-sources); and (3) untrusty and 
fraudulent data are eliminated by pre-analysis and post-analysis. 
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Abstract. The Apriori algorithm’s frequent itemset approach has become 
the standard approach to discovering association rules. However, the com- 
putation requirements of the frequent itemset approach are infeasible for 
dense data and the approach is unable to discover infrequent associations. 
OPUS-AR is an efficient algorithm for association rule discovery that 
does not utilize frequent itemsets and hence avoids these problems. It can 
reduce search time by using additional constraints on the search space 
as well as constraints on itemset frequency. However, the effectiveness 
of the pruning rules used during search will determine the efficiency of 
its search. This paper presents and analyses pruning rules for use with 
OPUS-AR. We demonstrate that application of OPUS_AR is feasible for a 
number of datasets for which application of the frequent itemset approach 
is infeasible and that the new pruning rules can reduce compute time by 
more than 40%. 

Keywords: machine learning, search. 



1 Introduction 

Association rule discovery has been dominated by the frequent itemset strategy 
as exemplified by the Apriori algorithm [2]. OPUSWR utilizes an alternative as- 
sociation rule discovery strategy to find associations without first finding frequent 
itemsets [16]. This avoids the need to retain the set of frequent itemsets in mem- 
ory, a requirement that makes the frequent itemset strategy infeasible for dense 
data [4] . This paper presents and evaluates pruning rules and other strategies that 
improve the computational efficiency of OPUSWR. 

We characterize the association rule discovery task as follows. 

— A dataset is a finite set of records where each record is an element to which 
we apply Boolean predicates called conditions. 

— An itemset is a set of conditions. The name itemset derives from association 
rule discovery’s origins in market basket analysis where each condition denotes 
the presence of an item in a market basket. 

— coverset{I) denotes the set of records from a dataset that satisfy itemset I. 

— An association rule consists of two conjunctions of conditions called the an- 
tecedent and consequent and associated statistics describing the frequency with 
which the two co-occur within the dataset. An association rule with antecedent 
A, consequent C, and statistics S is denoted as A ^ C'[*S']. 

The task involves finding all association rules that satisfy a set of user defined 
constraints with respect to a given dataset. 
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The frequent itemset strategy has become the standard approach to association 
rule discovery. This strategy first discovers all frequent itemsets. A frequent itemset 
is an itemset whose support exceeds a user defined threshold. The association rules 
are then generated from the frequent itemsets. If there are relatively few frequent 
itemsets this approach can be very efficient. However, it is subject to a number of 
limitations. 

1 . The user is required to nominate a minimum frequency. Associations with sup- 
port lower than this frequency will not be discovered. For some applications 
there may not be any natural lower bound on support and hence pruning the 
search space on minimum frequency in this manner may not be appropriate. 
Also, for some applications infrequent itemsets may actually be especially in- 
teresting. For example, especially high value transactions are likely to be both 
relatively infrequent and of high interest . This is known as the vodka and caviar 
problem. 

2. Even when a minimum frequency is applicable, there may be too many frequent 
itemsets for computation to be feasible. The frequent itemset approach requires 
that all frequent itemsets be maintained in memory. This imposes unrealistic 
memory requirements for many applications [4]. 

3. It is difficult to utilize search constraints other than minimum frequency to im- 
prove the efficiency of the frequent itemset approach. Where other constraints 
can be specified, potential efficiencies are lost. 

Most research in association rule discovery has sought to improve the efficiency 
of the frequent itemset discovery process [1,9, for example]. This has not addressed 
any of the above problems, except the closed itemset approaches [11,17], which 
reduce the number of itemsets required, addressing point 2, but not 1 or 3. 

OPUS_AR provides an alternative approach to association rule discovery based 
on the efficient OPUS search algorithm [15]. This extends previous work in rule dis- 
covery search [5,8,10,12,13,14,15] by searching for rules that optimize an objective 
function over a space of rules that allows alternative variables in the consequent. 
Previous algorithms have all been restricted to a single target consequent variable 
per search. 

OPUS_AR does not have significant memory requirements other than the re- 
quirement that all data be retained in memory. While it does not achieve the 
same degree of pruning as Apriori from a constraint on minimum frequency, it can 
utilize other constraints more effectively than Apriori. In particular, it can utilize 
constraints on the number of associations to be discovered, returning the n asso- 
ciations that optimize some criterion of interestingness. This provides a desirable 
contrast to the frequent itemset approach that is prone to generate extraordinarily 
large numbers of associations. In practice, only a small number associations are 
likely to be utilized by a user. A large number of associations is more likely to be 
a hindrance than an asset. 

Search space pruning rules are critical to the efficiency of OPUS_AR. Webb [16] 
utilized four such pruning rules. This paper presents two new pruning rules and 
additional mechanisms for reducing the computational requirements of OPUS_AR. 

This paper is organised as follows. Section 2 introduces the Apriori algorithm 
and analyzes its advantages and disadvantages. Section 3 introduces the OPUS 
search algorithm on which OPUSWR is based. Section 4 presents the OPUSWR 
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algorithm for discovering association rules. Section 5 describes the new pruning 
rules and other efficiency measures and presents experiments that demonstrate the 
effectiveness of these measures when discovering association rules on several large 
datasets. Section 6 presents conclusions. 

2 The Apriori Algorithm 

The Apriori algorithm discovers associations in a two-step process. First, it finds 
the frequent itemsets {/ C C : > min support}, where C is the set of all 

available conditions, T> is the dataset, and minsupport is a user defined minimum 
support constraint. In the second stage the frequent itemsets are used to generate 



-{a} 

-{b}— {a,b} 

”L{b, c}— {a,b, c} 
r{a,d} 

_{d}--{b,d}— {a,b,d} 

Me 

Mb, c,d}— {a,b,c,d} 
Fig. 1. A fixed-structure search space 



the association rules. The minimum support constraint on the frequent itemsets 
guarantees that all associations generated will satisfy the minimum support con- 
straint. Other constraints, such as minimum confidence are enforced during the 
second stage. 

The frequent itemset strategy can limit the number of rules that are explored, 
and cache the support values of the frequent items so that there is no need to access 
the dataset in the second step. It is very successful at reducing the number of passes 
through the data. The frequent itemset approach has become the predominant 
approach to association rule discovery. 

However, the frequent itemset approach is only feasible for sparse data. For 
dense datasets where there are numerous frequent itemsets, the overheads for main- 
taining and manipulating the itemsets are too large to make the system efficient 
and feasible [4]. This is also apparent in the experiments presented below. Dense 
datasets are common in applications other than basket data analysis or when bas- 
ket data is augmented by other customer information. Another problem of Apriori 
is that it lists numerous association rules to the user and it may be very difficult 
for the user to identify the interesting rules manually. Take the covtype dataset 
for example. Covtype has 581,012 records containing 125 items. The number of 
the association rules generated by Apriori with the minimum support set to 0.01, 
minimum confidence 0.8, and maximum itemset size 5 is 88,327,710. Since the 
Apriori algorithm generates itemsets by considering features of itemsets in isola- 
tion, the inter-relationships between the itemsets are not taken into account. In 
consequence, many association rules generated may not be of interest to the user. 
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3 The OPUS Search Algorithm 



OPUS [15] provides efficient search for subset selection, such as selecting a subset 
of available conditions that optimizes a specified measure. It was developed for 
classification rule discovery. Previous algorithms ordered the available conditions 
and then conducted a systematic search over the ordering in such a manner as to 
guarantee that each subset was investigated once only, as illustrated in Fig. 1. 

Critical to the efficiency of such search is the ability to identify and prune 
sections of the search space that cannot contain solutions. This is usually achieved 
by identifying subsets that cannot appear in a solution. For example, it might be 
determined that {6} cannot appear in a solution in the search space illustrated 
in Fig. 1. Under previous search algorithms [8,10,12,13,14], subsets that appear 
below such a subset were pruned, as illustrated in Fig. 2. In this example, pruning 
removes one subset from the search space. 

This contrasts with the pruning that would occur if all subsets containing the 
pruned subset were removed from the search space, as illustrated in Fig. 3. This 
optimal pruning almost halves the search space below the parent node. 



r{^} 

Ub}- X 






{a, c} 

{b, c}— {a,b, c} 



L{d}-^{b,d}— {a,b,d} 



4c,d}-[ 



{a, c,d} 

{b, c,d}— {a,b, c,d} 



Fig. 2. Pruning a branch from a fixed-structure search space 
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Fig. 3. Pruning all nodes containing a single condition from a fixed-structure search space 



OPUS achieves the pruning illustrated in Fig. 3 by maintaining a set of available 
items at each node in the search space. When adding an item i to the current 
subset s results in a subset s U {i} that can be pruned from the search space, i 
is simply removed from the set of available items at s which is propagated below 
s. As supersets of s U {i} below s can only be explored after s U {i}, this simple 
mechanism with negligible computational overheads guarantees that no superset 
of a pruned subset will be generated in the search space below the parent of 
the pruned node. This greatly expands the scope of a pruning operation from 
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that achieved by previous algorithms which only extended to the space below the 
pruned node. Further pruning can be achieved by reordering the search space, but 
this proves to be infeasible in search for association rule discovery, as explained by 
Webb [16]. 

4 The OPUS AR Algorithm 

OPUSAR extends the OPUS search algorithm to association rule discovery [16]. 
To simplify the search problem, the consequent of an association rule is restricted 
to a single condition. Association rules of this restricted form are of interest for 
many data mining applications. 

Whereas OPUS supports search through spaces of subsets, the as- 
sociation rule search task requires search through the space of pairs 
(/ C conditions, c G conditions), where / is the antecedent and c the consequent 
of an association. OPUSAR achieves this by performing OPUS search through 
the space of antecedents, maintaining at each node a set of potential consequents, 
each of which is explored at each node. 

The algorithm relies upon there being a set of user defined constraints on the 
acceptable associations. These are used to prune the search space. Such constraints 
can take many forms, ranging from the traditional association rule discovery con- 
straints on support and confidence to a constraint that only the n associations 
that maximize some statistic be returned. To provide a general mechanism for 
handling a wide variety of constraints, we denote associations that satisfy all con- 
straints target associations. Note that it may not be apparent when an association 
is encountered whether or not it is a target. For example, if we are seeking the 100 
associations with the highest lift, we may not know the cutoff value for lift until 
the search has been completed. Hence, while we may be able to determine in some 
circumstances that an association is not a target, we may not be able to determine 
that an association is a target until the search is completed. To accommodate this, 
pruning is only invoked when it is determined that areas of the search space cannot 
contain a target. All associations encountered are recorded unless the system can 
determine that they are not targets. However, these associations may be subse- 
quently discarded as progress through the search space reveals that they cannot 
be targets. When seeking the n best associations with respect to some statistic, we 
can determine that a new association is not a target if its value on that statistic 
is lower than the value of the best recorded so far, as the value of the best 
for the search space cannot be lower than the value of the best for the subset 
of the search space examined so far. 

Table 1 displays the algorithm that results from applying the OPUS search 
algorithm [15] to obtain efficient search for this search task. The algorithm is 
presented as a recursive procedure with three arguments: 

CurrentLHS: the set of conditions in the antecedent of the rule currently being 

considered. 

AvailableLHS: the set of conditions that may be added to the antecedent of 

rules to be explored below this point. 

AvailableRHS: the set of conditions that may appear on the consequent of a 

rule in the search space at this point and below. 
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Table 1. The OPUS search algorithm adjusted for search for association rules 

Algorithm: OPUS AR (CurrentLHS,AvailableLHS,AvailableRHS) 

1. SoFar := {} 

2. FOR EACH P in AvailableLHS 

2 . 1 IF pruning rules cannot determine that Vx C AvailableLHS : Vy G 
AvaiableRHS ; -itarget (x U CurrentLHS U {P} y) THEN 
2. 1.1 NewLHS := CurrentLHS U {P} 

2. 1.2 NewAvailableLHS ;= SoFar - P 

2. 1.3 IF pruning rules cannot determine that Vx C NewAvailableLHS : Vy G 
AvailableRHS : -.target (x U NewLHS -t y) THEN 

(a) NewAvailableRHS := AvailableRHS - P 

(b) IF pruning rules cannot determine Vy G NewAvailableRHS : -.target 
(NewLHS -t y) THEN 

(b. 1) FOR EACH Q in NewAvailableRHS 

i. IF pruning rules determine that Vx C NewAvailableLHS : -.target (x 
U NewLHS Q) THEN 

A. NewAvailableRHS := NewAvailableRHS - Q 
ii. ELSE IF pruning rules cannot determine that -.target (NewLHS — ^ Q) 
THEN 

A. IF target (NewLHS -> Q) THEN 
A . 1 record NewLHS —>■ Q 

A. 2 tune the settings of the statistics 

B. IF pruning rules determine that Vx C NewAvailableLHS: -.target 
(x U NewLHS ->■ Q) THEN 

NewAvailableRHS := NewAvailableRHS - Q 

(c) IF NewAvailableLHS yt {} and NewAvailableRHS ^ {} THEN 

OPUS AR (NewLHS, NewAvailableLHS, NewAvailableRHS) 

(d) SoFar := SoFar U {P} 



The initial call to the procedure sets CurrentLHS to {}, and AvailableLHS and 
AvailableRHS to the set of conditions that are to be considered on the antecedent 
and consequent of association rules, respectively. 

The algorithm OPUS jVR is a search procedure that starts with the associations 
with one condition in the antecedent and searches through successive associations 
formed by adding conditions to the antecedent. It loops through each condition in 
AvailableLHS, adds it to CurrentLHS to form the NewLHS. For the NewLHS, it 
loops through each condition in AvailableRHS to check if it could be the consequent 
for NewLHS. After the AvailableRHS loop, the procedure is recursively called 
with the arguments NewLHS, NewAvailableLHS and NewAvailableRHS. The two 
latter arguments are formed by removing the pruned conditions from AvailabeLHS 
and AvailableRHS, respectively. Step 2.1.3(b.l)ii.A.l records the potential target 
associations. 



5 Pruning in Search for Association Rules 

Webb [16] utilized four pruning rules to prune the search space explored by 
OPUS AR. We present two new pruning rules and two data access saving rules for 
improving the efficiency of OPUSAR. In order to evaluate their impact, experi- 
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merits are performed on five large datasets from the UCI ML and KDD repositories 
[6,3]. These datasets are listed in Table 2. 

The four pruning rules presented in Webb [16] are taken as the basic pruning 
rules in our experiments. Column “basic pruning” of Table 3 lists the times of run- 
ning OPUS_AR with these basic pruning rules on the five datasets. We test on the 
same datasets the running times of OPUS-AR with the basic pruning plus each of 
the pruning mechanisms introduced below. We also compare the performance with 
the publicly available apriori system developed by Borgelt [7]. In all the experi- 
ments OPUS_AR seeks the top 1000 associations on lift within the constraints of 
minimum confidence set to 0.8, minimum support set to 0.01, and the maximum 
number of conditions in antecedent of an association set to 4. The same minimum 
support, minimum confidence, and maximum antecedent size are used for Apriori, 
thus the maximum itemset size is 5 for Apriori because itemsets are required that 
contain up to 4 antecedent conditions as well as the single consequent condition. 
The experiments were performed on a Linux server with 2 CPUs each 933MHz in 
speed, 1.5G RAM, and 4G virtual memory. 



5.1 Formal Description of Association Rule Discovery Based on 
OPUS AR 

A formal description of association rule discovery based on OPUS_AR is given in 
the following. 

Definition 1. An association rule discovery task based on OPUS-AR (abbreviated 
as ARJbyJDPUS) is a 4-tuple {C,V,A,A4), where 
C: nonempty set of conditions; 

T> .-nonempty set of records, called the dataset, where for each record d €V, d CC. 
For any S C C, let coverset(S) = {d\d G T> A S C d}, and let cover(S) = 

\coverset{S)\ 

m ’ 

A: set of association rules, where each association rule takes the form 

X Y [cover age, support, confidence, lift] 

where X C C, X yf 0, P C C, \Y\ = I, X 0 Y = tb, and 
coverage, support, confidence, and lift are statistics for the association rule. 



Table 2. Datasets for experiments 



name 


records 


attributes 


values 


covtype 


581012 


55 


125 


ipums.la.99 


88443 


61 


1883 


ticdata2000 


5822 


86 


709 


connect-4 


67557 


43 


129 


letter-recognition 


20000 


17 


74 



satisfying coverage{X ^ Y) = cover(X), support{X ^ P) = cover(X U P), 

confidenceiX Y) = ZZZA-Vy “»<< “fHX ^ Y) = 

A4: constraints, composed of maxAssocs denoting the maximum number of tar- 
get association rules (which will consist of the association rules with the high- 
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est values for lift of those that satisfy all other constraints), maxLH Ssize de- 
noting maximum number of conditions allowed in the antecedent of associa- 
tion rule, minCoverage denoting the minimum coverage, minSupport denot- 
ing the minimum support, minConfidence denoting the minimum confidence, 
and minLift = / 3 {RS,maxAssocs)), where RS is the set of asso- 

ciations {R : coverage(R) ^ minCoverage A support(R) ^ minSupport A 
confidence{R) ^ minConfidence}, and f 3 {Z,n) is the lift of the as- 
sociation in Z sorted from highest to lowest by lift. An association rule 
X — > Y [cover age, support, confidence, lift] is a target iff it satisfies |X| ^ 
maxLHSsize,coverage{X ^ Y) ^ minCoverage, support{X ^ Y) 
minSuport, confidence{X ^ Y) ^ minConfidence, and lift{X ^ Y) ^ 
minLift. 

Theorem 1. Suppose ARJoyJDPU S = (C,'D,A,M.). For any Si C C, S2 C C, 
and Si C S2, coverset{S2) C coverset(Si) holds. This is to say, cover{S2) ^ 
cover(Si) holds. 

Proof. For any d € coverset{S2), according to Definition 1 , S2 d holds. Since 
Si C S2, Si C d holds. Hence d G coverset(S'i). So coverset{S2) C coverset(Si) 
holds. □ 



Theorem 2. Suppose AR_by_OPU S = {C,V,A,M). For any nonempty 

Si,S2,S^ C C satisfying SiC\ S2 = 0, S'2 F S'3 = 0, and SiO S3 = 0, if 



cover(Si) = cover{Si U S'2) 

the following holds. 


(1) 


cover{Si U S3) = cover{Si U S2 U S3) 
Proof. From (1) and Definition 1, we have 


(2) 


|cor'erset(Si)| = \coverset{Si U S2)| 

From Theorem 1, 


(3) 


coverset(Si) A coverset(Si U S2) 
From (3) and (4), we get 


(4) 


coverset(Si) = coverset{Si U S2) 


(5) 


For any d G V f\ SiU S3 C d, Si C d and S3 C d hold. From Si 
get Si U S2 C d. From S3 C d, Si U S2 U S3 C d holds. Hence 


C d and (5), we 


coverset(Si U S3) C coverset{Si U S2 U S3) 
From Theorem 1, we have 


(6) 


coverset{Si U S2 U S3) C coverset{Si U S3) 


(7) 



From (6) and (7), coverset{Si U 53) = coverset{Si U S'2 U S3) holds. Hence (2) is 
proved. □ 
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5.2 Pruning the Consequent Condition Before the Evaluation of 
Association Rule 

One of the pruning rules at Step 2.1.3(b.l)i is used to prune the consequent con- 
dition according to the current lower bound on minSupport before the evaluation 
of the association rule. This pruning rule is based on the following theorem. 

Theorem 3. Suppose AR.by.OPUS = {C,T>,A,M). For any association rule 
X ^Y, if cover (Y) < minSupport, X ^ Y is not a target. 

Proof. According to Definition 1 and Theorem 1, we get 

support{X ^ Y) = cover{X U T) ^ cover(Y) < minSupport 

Hence A ^ F is not a target. □ 

From this theorem, we get the following pruning rule. 

Pruning 1 In OPUS-AR for ARJoy^OPU S = {C,V,A,M), for any condition 
Q G AvailableRH S , if cover{Q) < minSupport, then Q can be pruned from 
N ew AvailableRH S . 

According to Theorem 3, any association rule with such Q as the consequent 
can not be a target, therefore Q can be pruned. The “pruning 1 added” column of 
Table 3 lists the times for OPUS_AR on the five datasets with the basic pruning 
and pruning 1. 

5.3 Pruning the Consequent Condition after the Evaluation of 
Association Rule 

This pruning rule at Step 2.1.3(b.l)ii.B is used to prune the consequent condition 
after the evaluation of the current association rule. It is based on the following 
theorem. 

Theorem 4. Suppose ARJoy-OPU S = {C,V,A,M.). For any association rule 
X ^ Y, if confidence{X ^ Y) = 1, for any Xi C C satisfying Xi D X = 
0AAinF = 0A cover {X U Ai) ^ 0, the following holds. 

lift{X U Ai ^ F) = lift{X Y) 

Proof. From confidence{X ^ F) = 1, we get 

support{X ^ F) = coverage{X Y) 



that is to say, 



couer(A) = cover{X U F) (8) 

From (8) and Theorem 2, cover{X U Ai) = cover{X U Ai U F) holds. Since 
cover {X U Ai) yf 0, hence 

support{X U Ai ^ F) = coverage{X U Ai ^ F) 0 
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Therefore confidence{X U Xi ^ 1") = 1. From Definition 1, the following two 
equations hold. 

u V I I V ^ _ confidence{X \J Xi ^ Y) _ 1 

hft[X U Xi > Y) — — 

cover(Y ) cover(Y } 



hft{x ^Y) = 

Hence lift{X U Xi ^ Y) = lift{X 



confidence{X — 
cover(Y) 
Y) holds. 






1 

cover{Y) 



□ 



From this theorem, we get the following pruning rule. 



Pruning 2 In OPUS-AR for AR_byjOPUS = {C,V,A,M), after the evaluation 
of the current association rule NewLHS Q, ifconfidence{NewLHS ^ Q) = 1 
and lift{NewLHS ^ Q) < minLift, Q can he pruned from NewAvailableRHS. 



According to the above theorem, all of the association rules with Q as the 
consequent in the search space below the current node take the same lift value 
as NewLHS — > Q. Therefore if lift{NewLHS — > Q) < minLift, none of these 
rules can be target association, Q can be pruned from N ewAvailableRH S . The 
“pruning 2 added” column of Table 3 lists the times for OPUS_AR on the five 
datasets with the basic pruning and pruning 2. For “covtype,” the compute time 
is reduced by this pruning to less than 55% of that supported by the basic pruning 
rules, for “ipums.la.99,” the compute time is reduced to less than 66% of the basic 
pruning. 



5.4 Saving Data Access for the Current Association Rule by 
minConfidence 

In order to evaluate the number of records covered by set of conditions, the dataset 
is normally accessed by OPUS_AR at least once for each association rule antecedent 
and once for the union of the antecedent and consequent. Techniques for saving 
such data access can improve the efficiency of the algorithm. Whereas the pruning 
rules save data access by discarding the region of the search space below a node, 
the saving rules save data access for a node without removing its branch. 

Step 2.1.3(b.l)ii is for saving data access for the current association rule 
NewLHS Q. We are going to introduce two of the saving rules adopted at 
this step, one is by minConfidence, based on the following theorem, and the 
other is by the antecedent of the current association rule, described in the next 
section. 



Theorem 5. Suppose AR_by_OPU S = {C,V,A,M). For any association rule 
X ^ Y, if < minConfidence, X ^ Y is not a target. 

Proof. According to Definition 1, we have 



confidence{X Y) 



support{X Y) 
coverage{X Y) 



cover{X U Y) 
cover (X) 



According to Theorem 1, cover{X U T) ^ cover(Y) holds. Since < 

minC onfidence. 
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covct(Y^ 

confidenceiX ^ Y) ^ < minConfidence 

cover (X ) 

Therefore X ^ F is not a target. □ 

From this theorem, we get the following data access saving rule. 

Saving 1 In OPUS-AR for ARJoy^OPU S = {C,'D,A,M), for the current as- 
sociation NewLHS Q, if \NewLHS\ = maxLHSsize and cover(NmttLHS) ^ 
minConfidence, there is no need to access data to evaluate NewLHS Q, as it 
is not a target. 

The reason that the saving is adopted instead of pruning under this situation is 
in the branch below the current NewLHS Q, some of the supersets of NewLHS 
with lower values of coverage might make the association have confidence larger 
than minConfidence. While saving data access, the pruning based on the results 
of the data access is not available anymore, thus the overall efficiency might be 
slowed down accordingly. Due to this, \NewLHS\ = maxLHSsize is added to the 
above saving rule to ensure that it is applied only at the maximum search depth 
where no pruning is necessary. 

The “saving 1 added” column of Table 3 lists the times for OPUS_AR on the 
five datasets with the basic pruning and this saving rule. 

5.5 Saving Data Access for the Current Association Rule by the 
Antecedent 

Another saving rule at Step 2.1.3(b.l)ii for the current associations rule 
NewLHS Q, where NewLHS = CurrentLH S U {P}, P € AvailableLHS, 
functions according to the relation between CurrentLH S and P. It is based on 
the following theorem. 

Theorem 6. Suppose ARJoyJDPU S = (C,T>,A,A4). For any association rule 
X ^ Y and X U {P} Y where P & C, P ^ X and P ^ Y, if cover{X) = 
cover{X U {P}), the following hold. 

coverage{X ^ Y) = coverage{X U {P} ^ Y) (9) 

support{X ^ y) = support{X U {P} ^ Y) (10) 

confidence{X ^ Y) = confidence{X U {P} — > Y) (11) 

lift{X ^Y) = lift{X U {P} ^ Y) (12) 

Proof. According to cover(X) = cover{X U {P}), (9) holds. From cover(X) = 
cover{X U {P}) and Theorem 2, the following holds. 

cover{X U F) = cover{X U {P} U F) (13) 

From (13), (10) holds. Hence (11) and (12) are proved. □ 

From this theorem, we get the following data access saving rule. 
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Saving 2 In OPUS-AR for ARJtyJOPUS = {C,V,A,M), for the current as- 
sociation NewLHS Q where NewLHS = CurrentLHS U {P}, P G 
AvailableLH S , if \NewLHS\ = maxLHSsize, the number of current tar- 
get associations is less than \coverset{NewLHS)\, and cover{CurrentLH S) = 
cover{N ewLH S) , instead of accessing data to evaluate NewLHS Q, check 
if CurrentLHS Q exists in the current target associations, and if yes, copy 
all the statistic values of CurrentLHS ^ Q to NewLHS — > Q, otherwise, 
NewLHS ^ Q is not a target. 

Since CurrentLHS Q is investigated before NewLHS ^ Q in OPUS_AR, 
and they share the same statistic values, NewLHS Q will be a target if and 
only if CurrentLHS ^ Q is a, target. Due to the same reasons as in the above 
section, we add \NewLHS\ = maxLH Ssize in the saving rule to make sure that 
application of the saving rule can not slow down the overall efficiency. If the num- 
ber of current target associations is larger than \coverset{N ewLH S)\, searching 
current target associations might become less efficient than accessing data of the 
amount of \coverset{NewLH S)\ for computing cover{N ewLH S U Q). 

The “saving 2 added” column of Table 3 lists the times for OPUS_AR on the 
five datasets with the basic pruning and this saving rule. For both “covtype” and 
“connect-4,” the compute times are reduced by this saving to less than 66% of 
that supported by the basic pruning rules. 

Table 3. Efficiency improvements by penning in OPUS_AR and efficiency of Apriori 



datasets 


OPUS_AR 


Apriori 


basic 

pruning 


pruning 
1 added 


pruning 
2 added 


saving 
1 added 


saving 
2 added 


all 

added 


covtype 


7:33:50 


5:28:21 


4:6:58 


6:25:16 


4:59:19 


3:4:19 


77:56:3 


ipums.la.99 


11:38:31 


9:2:37 


7:40:12 


11:27:9 


9:25:16 


6:28:38 


19:45:5 


ticdata2000 


25:28:43 


24:34:12 


23:41:12 


24:56:10 


22:34:7 


23:18:29 


— 


connect-4 


1:48:51 


1:24:59 


1:10:9 


1:30:35 


1:11:37 


0:48:33 


3:15:26 


fetter-recognition 


0:0:23 


0:0:20 


0:0:20 


0:0:23 


0:0:22 


0:0:20 


0:0:35 



5.6 Efficiency Comparison between OPUS.AR and Apriori 

The “all added” column of Table 3 lists the times on the datasets for OPUS_AR 
with the new pruning mechanisms composed of pruning 1 and 2 and saving rule 
1 and 2 all added to the four original pruning rules. For all datasets other than 
“ticdata2000,” combining all rules results in more efficient search than utilizing 
any of the rule alone. The interaction between rules than increases compute times 
for “ticdata2000” merits further investigation. For “covtype,” “connect-4,” and 
“ipums.la.99,” the compute times are reduced to less than 41%, 45% and 56% of 
that supported by the original pruning rules, respectively. 

The CPU times of running Borgelt’s Apriori system on the five datasets are 
listed in the “Apriori” column of Table 3. The inefficiency of Apriori for dense 
datasets is demonstrated by the fact that on every dataset OPUS-AR is more 
efficient than Apriori, and that for “ticdata2000,” Apriori runs out of memory 
when processing itemsets of size 4. 





Further Pruning for Efficient Association Rule Discovery 617 



6 Conclusions 

OPUS_AR provides an alternative to the frequent itemset approach to association 
rule discovery. Our experiments have demonstrated that OPUS_AR can provide 
more efficient association rule discovery than apriori for dense datasets, and can 
make association rule discovery feasible where the memory requirements of the 
frequent itemset approach can make its application infeasible. OPUS-AR has the 
further advantage that it can utilize constraints other than minimum frequency to 
prune the search space. This makes feasible association rule discovery where there 
is no natural lower limit on the support for an association. 

This paper has presented new pruning rules and data access saving rules for 
OPUS_AR, which result in the reduction of compute times by as much as 41% 
compared with those resulting from the original mechanisms only. These results 
again demonstrate that OPUS_AR can support fast association rule discovery from 
large dense datasets. 

References 

1. R. Agarwal, C. C. Aggarwal, and V. V. V. Prasad. Depth first generation of long 
patterns. In Proc. Sixth ACM SIGKDD Int. Conf. Knowledge Discovery and Data 
Mining (KDD2000), pages 108-118, Boston, MA, August 2000. ACM. 

2. R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items 
in massive databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data, 
pages 207-216, 1993. 

3. S. D. Bay. The UCI KDD archive, [http://kdd.ics.uci.edu] Irvine, CA: University of 
California, Department of Information and Computer Science., 2001. 

4. R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. 1998 ACM- 
SIGMOD Int. Conf. Management of Data, pages 85-93, 1998. 

5. R. J. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining in large, 
dense databases. Data Mining and Knowledge Discovery, 4(2/3):217-240, 2000. 

6. C. Blake and C. J. Merz. UCI repository of machine learning databases. [Machine- 
readable data repository] . University of California, Department of Information and 
Computer Science, Irvine, CA., 2001. 

7. C. Borgelt. apriori. (Computer Software) 
http://fuzzy.cs.Uni-Magdeburg.de/ borgelt/, February 2000. 

8. S. H. Clearwater and F. J. Provost. RL4: A tool for knowledge-based induction. 
In Proc. Second Inti. IEEE Conf. on Tools for AI, pages 24-30, Los Alamitos, CA, 
1990. IEEE Computer Society Press. 

9. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. 
In Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’OO), 
Dallas, TX, May 2000. 

10. S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated 
association rules. In Proc. ACM SIGKDD Workshop on Large-Scale Parallel KDD 
Systems, volume LNAI 1759, pages 127-144. Springer, Berlin, 2000. 

11. J. Pei, J. Han, and R. Mao. CLOSET: An efficient algorithm for mining frequent 
closed itemsets. In Proc. 2000 ACM-SIGMOD Int. Workshop on Data Mining and 
Knowledge Discovery (DMKD’OO), Dallas, TX, May 2000. 

12. F. Provost, J. Aronis, and B. Buchanan. Rule-space search for knowledge-based 
discovery. CHO Working Paper IS 99-012, Stern School of Business, New York 
University, New York, NY 10012, 1999. 




618 S. Zhang and G.I. Webb 



13. R. Rymon. Search through systematic set enumeration. In Proc. KR-92, pages 
268-275, Cambridge, MA, 1992. 

14. R. Segal and O. Etzioni. Learning decision lists using homogeneous rules. In AAAI- 
94, Seattle, WA, 1994. AAAI press. 

15. G. I. Webb. OPUS: An efficient admissible algorithm for unordered search. Journal 
of Artificial Intelligence Research, 3:431-465, 1995. 

16. G. I. Webb. Efficient search for association rules. In The Sixth ACM SIGKDD 
Int. Conf .Knowledge Discovery and Data Mining, pages 99-107, Boston, MA, 2000. 
The Association for Computing Machinery. 

17. M. J. Zaki. Generating non-redundant association rules. In Proceedingsof the Sixth 
ACM SIGKDD Int. Conf. Knowledge Diseovery and Data Mining (KDD2000), pages 
34-43, Boston, MA, August 2000. ACM. 




Pattern Discovery in Probabilistic Databases 



Shichao Zhang and Chengqi Zhang 



School of Computing and Mathematics 
Deakin University, Geelong, Vic 3217, Australia 
{scz, chengqi}@deakin. edu.au 



Abstract. Modeling probabilistic data is one of important issues in 
databases due to the fact that data is often uncertainty in real-world 
applications. So, it is necessary to identify potentially useful patterns in 
probabilistic databases. Because probabilistic data in INF relations is 
redundant, previous mining techniques don’t work well on probabilistic 
databases. For this reason, this paper proposes a new model for mining 
probabilistic databases. A partition is thus developed for preprocessing 
probabilistic data in a probabilistic databases. We evaluated the pro- 
posed technique, and the experimental results demonstrate that our ap- 
proach is effective and efficient. 



1 Introduction 

Association analysis for large databases has received much attention recently [1]. 
Recently, there are also much work on mining special databases. For example, 
spatial data mining [5] and image data mining [2] . 

However, there is no work on mining probabilistic databases. Indeed, today’s 
database systems must handle uncertainties in the data they store. Such uncer- 
tainties arise from different sources such as measurement errors, approximation 
errors, and the dynamic nature of real world. For example, in an image retrieval 
system, an image processing algorithm may fetch images that are similar to a 
given sample image, and feed the results into a relational database. The re- 
sults are generally uncertain. Because probabilistic data in INF (First Normal 
Form) relations is redundant, traditional mining techniques don’t work well on 
probabilistic databases. For this reason, a new mining model for probabilistic 
databases is established in this paper, which the probabilistic data model in [3] 
is adopted. A dependent rule is thus identified in a probabilistic database, repre- 
sented in the form X ^ Y with conditional probability matrix My\x- We now 
illustrate the above argument with the following example. 

Example 1. Consider a probabilistic personnel database in some university. The 
interest data is the set of records with respect to “education” , “salary” and “pS” 
of employee as Table 1. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 619-630, 2001. 
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Table 1. A probabilistic relation 



EMP# 


Education 


salary 


pS 


EMP# 


Education 


salary 


pS 


3025 


Doctor 


4100 


0.8 


3025 


Doctor 


2500 


0.1 


3025 


Doctor 


1800 


0.1 


6637 


Doctor 


3500 


0.14 


6637 


Doctor 


2400 


0.06 


6637 


Master 


3500 


0.1 


6637 


Master 


2400 


0.6 


6637 


Master 


1800 


0.1 


7741 


Bachelor 


3500 


0.1 


7741 


Bachelor 


2400 


0.1 


7741 


Bachelor 


1500 


0.8 











Let’s examine the existing techniques check if some of them can work well 
on the above table. 

1 . The first solution (item-based technique) is to identify association rules such 
as “3025 — > Doctor” and “7741 — > Bachelor” . 

The above association rules are uninteresting in the probabilistic database. 
In other words, item-based technique cannot work well on the above table. 

To mine such database, we firstly need to partition the domain of Education 
into Doctor, Master, and UnderMaster] and the domain of Salary into [3500, 
-l-oo), [2100,3500) and [0,2100). Doctor, Master, UnderMaster, [3500, -l-oo), 
[2100, 3500) and [0,2100) are called quantitative items. Secondly, we compute 
the probabilities of quantitative items in the database. Let X and Y stand for 
Education and Salary, respectively, n, T 2 , • • • , tiq be sequentially tuples in Table 
1. Then for EMP#3025, 

p{X — Doctor) = Ti{pS) + T 2 {pS) + T 3 {pS) — 0.8 -I- 0.1 -(- 0.1 = 1, 

p{X = Master) = 0,p{X = UnderMaster) = 0,p{Y — [3500, -l-oo)) = 0.8, 
p(Y = [2100, 3500)) = O.l, p{Y = [0, 2100)) = 0.1. 

2. The second solution (quantitative-item-based technique) is to identify associ- 
ation rules such as “Education = Doctor Salary > 3500” , Education — 
Master Salary G [2100,3500), and “Education = UnderMaster — > 
Salary < 2100”, using the techniques proposed in this paper. 

Quantitative-item-based technique certainly works better than item-based 
technique on this table. However, the three rules only express a part of the 
relationships between attributes “Education” (X) and “Salary” (T). An ideal 
approach is advocated in this paper. 

3. Our approach (the third solution) applies a conditional probability matrix 
^Y\x for X ^ y to fit the probabilistic data. For the data, if they are fitted 
in a conditional probability matrix My\xi then the dependency between X 
and Y can be described by this matrix. A main goal of this paper is to 
build a model to learn this probabilities in next subsection. Actually, using 
the algorithm in Section 4 we can acquire a conditional probability matrix 
My\x from the above data as follows. 
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Pll Pl2 Pl3 




'0.8 0.1 o.r 


My\x = 


P21 P22 P23 
_P31 P32 P33 _ 


— 


0.1 0.8 0.1 

0.1 0.1 0.8 



As have seen, dependent rules can perfectly catch the relationship between 
pairs of multi-values variables and present more challenging than quantitative- 
item based rules. 

The rest of this paper is organized as follows. Section 2 presents some needed 
concepts. In Section 3, a partitioning is proposed. Section 4 builds a statistical 
model of mining probabilistic databases using a partition. The experiments are 
illustrated in Section 5. In the last section, a summary of this paper is presented. 



2 Basic Definition 

Assume / is a set of items in database D. A subset of a same type of items in I 
is referred to quantitative item. 

An item-based association rule is a relationship of the form A=> where A 
and B are itemsets and A n i? = 0. It has both support and confidence greater 
than or equal to some user specified minimum support (minsupp) and minimum 
confidence (minconf) thresholds, respectively. 

A quantitative association rule is a relationship of the form 

(attributeljValuel) {attribute2,value2), 

where attributel and attribute2 are attributes, valuel and value2 are subsets of 
the domains of attributel and attribute2 respectively, {attributel, valuel) and 
(attribute2,value2) are quantitative items. 

A dependent rule is a relationship between X and Y of the form X ^ Y with 
a conditional probability matrix My\x [6], where X and Y are variables with 
valuing in ranges R{X) and R{Y) respectively, x G R{X) is called a point-value 

of X, where x is a quantitative item. And My\x is given as My\x=[p{yj\xi)]m-Kn, 

where, “=” denotes definition symbol. p{yj\xi) = p{Y = yj\X = Xi) are condi- 
tional probabilities, i = 1, 2, • • • , m, j = 1, 2, • • • , n. 

The problem of mining dependency rules is to generate all rules X ^ Y 
that have with both support and My\x, which support is greater than or 
equal to some user specified minimum support (minsupp) threshold. 

3 Data Partitioning 

In data mining, there are two main partitioning data models: knowledge based 
partitioning model [4] and equi-depth partitioning model [7]. We will propose 
a so-called “good partition” to generate quantitative items and item variables 
for a given database, which decomposes the “bad quantitative items” and “bad 
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item variables” and composes the “not-bad quantitative items” and “not-bad 
item variables” . 

Generally, a quantitative item doesn’t occur in the transactions of a database. 
To find quantitative association rules from databases, we say that a quantitative 
item i is contained by a transaction t of a database D if existing at least one 
element of i occur in t (Note that each quantitative item consists of multiple 
simpler items). And the support of the quantitative item i is defined as 100 * s% 
of transactions in D that contain at least one element of i. Or 

s = m\/\D\ 

where i{t) = {t in D\t contains at least one element of i}. 

In this way, we can map quantitative association rules problem into Boolean 
association rules problem [7]. And some item-based mining techniques and algo- 
rithms can also used to identify quantitative association rules. 



Quantitative Items. In previous sections, we partition the items in domains 
of Education and Salary into {Doctor, Master, UnderMaster} and {[3500, 
-l-oo), [2100, 3500), [0, 2100)} respectively. 

According to different requirements in applications, we can divide them into 
different sets of quantitative items. For example, we can partition R(Salary) such 
as {[7200, -hoo), [3500, 7200), [2100, 3500), [0, 2100)}, {[3500, -hoo), [2100, 3500), [0, 
2100)}. However, a reasonable partition also needs to consider the supports of 
items and the associated degree with other items. We apply the decomposition 
and composition for quantitative items to generate good partition. 

Clearly, bad quantitative items will not contribute to quantitative association 
rules. And not-bad quantitative items should also be avoided unless they are 
required. We shall now present the algorithm to decompose bad quantitative 
items and compose not-bad quantitative items as follows. 

Procedure 1 DecGomposeQI 

begin 

Input.- I: set of all items, QI: set of all quantitative items in property, 

Output.- OQI: set of optimized quantitative items; 

(1) let OQI ^ empty set; qset ^ QI; 
for any element q in qset do 

if fi,Z 2 G (? and they are not associated tolerant then 
decompose q into two sub-quantitative items qi and 52 such that 
qiUq 2 = q, ii € qi and ^2 G 92 , 

(2) for any two elements qi and <72 in qset do 

if qi and <72 are not-good quantitative items then 

compose qi and (72 into a new quantitative item q such that q = 

qi U <72 and q is not a bad quantitative item; 

(3) let OQI ^ all good quantitative items; 
output OQI; 



end; 
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Procedure DecComposeQI is to generate a set OQI of optimized quantitative 
items by decomposing “bad quantitative items” and composing (property or 
associated) tolerant quantitative items. 



Item Variables. According to our partitioning model, an attribute can be taken 
as an item variable. For the above item variables X and Y, X is the set of 
quantitative items: Doctor, Master and UnderMaster of attribute Education, 
i.e., any element of X is denoted a degree of education; and Y is the set of 
quantitative items [3500, +oo), [2100,3500) and [0,2100) of attribute Salary. 

An item variable is the generalization of some quantitative items with a 
property. For previous examples, QI = {qi,q2,<l3,q4,<l5,<}6}j we take X and Y 
as two item variables with domains R{X) = {gi, <72, 9s} and R{Y) = {94, 95, 95}, 
respectively. And X is the generalization of 91, 92 and 93, F is the generalization 
of 94, 95 and 96. Then the sequence of the item variables X and F is a partition 
over QI. 

As before, bad item variables should not be used to mine dependent rules. 
And not-bad item variables are also to be avoided unless they are required. 
We can obtain the decomposition of bad item variables and the composition of 
not-bad item variables as follows. 

Procedure 2 DecComposelV 

begin 

Input.- OQI: set of all optimized quantitative items, IV : set of all item 

variables in property; 

Output.- OIV : set of optimized item variables; 

(1) for any element X in vset do 

if 91,92 G R{X) and they are not associated tolerant then 
decompose X into two item variable Xi and X2 such that R{Xi) U 
R{X2) = R{X), 91 G R{Xi) and 92 G R{X2); 

(2) for any two elements Xi and X2 in vset do 

if Xi and X2 are not-good item variables then 
if Xi and X2 are property tolerant and associated tolerant then 
compose Xi and X2 into a new item variable X such that R{X) = 
R{Xi) U R{X2) and X is not a bad item variable; 

(3) let OIV ^ all optimized item variables; 
output OIV] 

end; 

Procedure DecComposelV is to generate a set OIV of optimized item vari- 
ables by decomposing “bad item variables” and composing (property or as- 
sociated) tolerant item variables. This procedure is similar to the procedure 
DecComposeQI. 

We now build the algorithm of partitioning model as follows. Let be a 
given database, I the set of all items in D. 
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Procedure 3 PartitionData 

begin 

Input; D: probabilistic database, I: set of all items in D; 

Output; OQI: the set of quantitative items, OIV : the set of item vari- 
ables; 

(1) Generating relative properties, attributes, and constraint conditions 
for D. 

(2) Generating the set QI of all quantitative items by these relative prop- 
erties, attributes, and constraint conditions, which all quantitative 
items are formed a partition of I. 

(3) Optimizing all the quantitative items into OQI using Procedure 1. 

(4) Generating the set IV of all item variables by these relative proper- 
ties and attributes. This means that each item variable can be viewed 
as a set of some quantitative items with same property (or attribute) 
in some sense. 

(5) Optimizing all the item variables into OIV using Procedure 2. 

end; 

Procedure PartitionData is to generate a partition on the given database. 
And obtain the set of optimized quantitative items OQI and the set of optimized 
item variables OIV . 



4 Identifying Dependent Rnles 

4.1 Preprocess of Data 

Generally, probabilistic relations have deterministic keys. That is, each tuple 
represents a known real entity. The non-key attributes describe the properties of 
the entities and may be deterministic or stochastic in nature. The Table 1 adopt 
a INF view of probabilistic relations. Its NINF view is in Table 2. 



Table 2. A probabilistic relation 



EMP# 


Education 


salary, pS 


3025 


doctor 


4100 , 0.8 

2500, 0.1 
1800, 0.1 


6637 


Doctor, 0.2 
master, 0.8 


3500, 0.24 

2400 , 0.66 
1800, 0.1 


7741 


bachelor 


3500, 0.1 

2400 , 0.1 

1500, 0.8 





Pattern Discovery in Probabilistic Databases 625 



In Table 2, X = (1,0,0) and Y = (0.8, 0.1, 0.1) for EMP# = 3025, X = 
(0.2, 0.8,0) and Y = (0.24,0.66,0.1) for EMP^f = 6637, X = (0,0,1) and 
Y = (0.1, 0.1, 0.8) for EMP# = 7741. 

Though NINF models provide a framework for describing intuitively the 
nature of uncertainty data, they pose the usual implementation problems asso- 
ciated with all NINF relations. Much of previous work on modeling probabilistic 
data is based on INF relations. So, our work in this paper is concentrated on 
INF probabilistic relational model. 

For description, the techniques of partitioning quantitative attributes are the 
same as the above section. And no losing generality, an attribute is taken as an 
item variable in this section. 

P(Z = a)= ^ T{pS) 

T{K) — kAr{Z)=a 

We now show data preprocess using a procedure as follows. 

Procedure 4 Generatedata 

begin 

Input.- D: probabilistic database, threshold values; 

Output.- PS: set of probabilities of interest; 

(1) call Partition(Z)) procedure of partitioning quantitative attributes; 
let IV r- all item variables; 

let PS ^ 0; 

(2) let DS 4- 0, 

for a subset X of set Z of IV beginfor 
let y ^ Z - X; 
let DS^DSG {X, F}; 
for each tuple r in D beginforl 
for each key value t{K) — k beginfor2 
for each element a in R{X) do 
let p{X = a)^ J2r{K)=kAriX)=a^ipS); 
for each element a in R{Y) do 
let p{Y = a) 4- Er(iC)=feAr(Y)=a^(P'S'); 
endfor2 
if |X| > 0 then 
let DS^DSU {p{X),p{Y)}; 
endforl 

let PS^PSU {DS}; 

endfor 

(3) output PS set of probability sets; 

endall. 

end; 

Procedure Generatedata is to preprocess the data in a given probabilistic 
database so as to find all interesting data. 

For a given probabilistic database, the preprocess of the database generates 
a set PS of sets DS of probabilities of item variables. 
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4.2 Mining Probabilistic Dependencies 

In this subsection, we first present a method to calculate conditional probability 
matrix Mx\y for a possible rule X ^ Y. Next to estimate the support of the 
rule. 

For a given probabilistic database, X and Y are two item variables. Let 
R{X) = {xi,X 2 , ...yXn}, R{Y) = {yi,y 2 , •■•,2/m}, and DS = {(a, 6)|a e S{X),b€ 
S'(F)} G PS be a set of k data generated by procedure Generatedata. In order 
to mine rule of the form X ^ Y, it needs to determine conditional probability 
matrix of Y given X: My\x- The influence of on F is the following formula 
according to Bayesian rule, P{Y = yi\X — x) — ^'^piyilxk) *p{xk), where 
X G R{X), i = 1, 2, • • • , m. 

In the following, P{Y = yi\X = x) is denoted by bi, p{yi\xj) is denoted by 
Pji, where i = 1,2, ■ ■ ■ ,m, j = 1, 2, • • • , n. Now given a = (p{xi) = ai,p{x 2 ) = 
tt 2 , ■■■,p{xn) = an) G S{X) as an observation, then bi can be solved in (2) as 
J2k O'kPki, where i = 1, 2, • • • , m. 

Intuitively, there is a relation between data a and b in formula (3) if (a, b) 
is an observation. And pji are invariant, aj and bi are variable factors. Thus, if 
these properties in (3) are utilized to learn pji from applications, we can acquire 
more probabilistic information as possible from the bounded resources. 

Our goal is to find the probability pji from probabilistic databases in this 
paper. So for DS, the following function is ideally expected for all elements of 
DS to satisfy: 



f {Plij P2ij ■■■t Pni) — ^ ' ajkPki bji) , 

t&DS k 

and the value of f{pu,P 2 i, ■■■,Pni) must be the minimum. Or the above formula 
can be written as 



/(Pli, P2z, ■•■, Pnz) — ^ ajkPki ^ji) ■ 

k 

Using the principle of extreme values in mathematical analysis, we can find 
the minimum by taking the partial derivatives over f{pu,P2i, ■■■,Pni) with respect 
to pii,P2i, ■■■,Pni we must determine, and then set these derivatives to 0. That 
is, 

9^ ~ 2 XXSfc o^jkPki — bji)aji = 0 

— 2 XXSfc (kjkPki — bji)aj2 = 0 

< 

, Sp77 “ ^ XXSfc (kjkPki — bji)ajn = 0 
or. 
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f PiiZX%i)^ + P2iY,{ajiaj2) H \- PmY,{ajiajn) - Y,{ajibji) = 0 



I Pli + P2t ZX«j2)^ H b Pm ZX«i2ajn) ~ ZX«i2^ji) = 0 



[pi^ZX^jiai") + P 2 iJ 2 i'^jnaj 2 ) H |-PmI](ain)^ - J 2 i^jnbji) = 0 

Let A be the coefficient matrix of this equation group about Pu,P2i, ■■■,Pm- 
If d = 1^1 yb 0, then this equation group has the only result, which is 

di d ,2 dn 

Pli — ? P 2 i — i‘ ‘ ‘ 1 Pni — i 

where di is the determinant of the matrix of after ith rank in A is replaced the 
constant rank ZX«j2^ji) • • •, i = 1, 2, • • • , m. 

In the above, pji represent the probabilities of F = yt under the conditional 
X = Xj, i = 1, 2, • • • , m; j = 1, 2, • • • , n. In order to assure the probability signif- 
icance level of the probabilities, the results should be: 



Pji ■■= Pji/{Pli +P2i-\ b Pm), 

where, i = 1, 2, • • • , m; j = 1, 2, • • • , n. 

Another measurement of A ^ F is its support. Because it is a probabilis- 
tic dependency rule, we define a metric to check the degree of My\x fitting 
the given fact set. For fact (a, 6) G DS, a = (oi, 02, • • • , a„) G 5'(A) and 
b = (61, 62, • • • , bm) G 5(F), let 6' = (6'i, 62, ' ' ’ , b'm) = « • My\x, then the fitting 
error is defined as 



error{b, b') = \b - b'\ = ^ \bi - 6'|. 

i=l 

If error{b, b') is less than or equal to some user specified maximum allowance 
error e, then fact (a, b) support the conditional probability matrix My^x- Let N 
be the size of DS the data set of interest, and M the number of data supporting 
My^x in DS. The support of A ^ F is defined as 

support{X,Y) = M/N 

If support{X, Y) > minsupp, A ^ F with My\x can be extracted as valid rule. 



4.3 Algorithm 

We now design the algorithm of the above statistical model. 
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Algorithm 1 statisticalm 



Input; D: probabilistic database, minsupp and e: threshold values; 
Output; X Y: dependent rule, My\x- the conditional probability 
matrix ofY given X; 

Begin 

Let DS ^ a set of probabilities in D with respect to item variables 
A and F; 

Calculate Myix', 

For (a,b) € DS do 

Statistics M the number of data supporting My\x in DS for e; 

If M/\DS\ > minsupp then 
Output X ^Y with My^x and support{X,Y); 

End. 

Algorithm statisticalm is to generate dependent rules of the form: X 
Y attached a conditional probability matrix My^xj from a given probabilistic 
database D. 

The above method can synthesize the probability meanings of all point values 
of a sample. We now illustrate the use of this algorithm by an example as follows. 

Example 2. For a given probabilistic database, X and Y are two item variables. 
Let R{X) = {x\,X 2 }, R(Y) = {yi, 2 / 2 , 2 / 3 }, 22 data are generated by procedure 
Generatedata as follows: 



Table 3. Probabilities of X and Y 



EMP 


\p{xi) p{x 2 ) p{yi) 


P{V2) 


p{ys) 


EMP 


p{xi) 


p{x 2 ) p{yi) 


P{V2) 


p{ys) 


01 


1 


0 


0.5 


0.3 


0.2 


02 


0 


1 


0.1 


0.6 


0.3 


03 


0.9 


0.1 


0.46 


0.33 


0.21 


04 


0.1 


0.9 


0.14 


0.57 


0.29 


05 


0.8 


0.2 


0.42 


0.36 


0.22 


06 


0.2 


0.8 


0.18 


0.54 


0.28 


07 


0.7 


0.3 


0.38 


0.39 


0.23 


08 


0.3 


0.7 


0.22 


0.51 


0.27 


09 


0.6 


0.4 


0.34 


0.42 


0.24 


10 


0.4 


0.6 


0.26 


0.48 


0.26 


11 


0.5 


0.5 


0.3 


0.45 


0.25 


12 


0.95 


0.05 


0.48 


0.315 


0.205 


13 


0.05 


0.95 


0.12 


0.585 


0.295 


14 


0.85 


0.15 


0.44 


0.345 


0.215 


15 


0.15 


0.85 


0.16 


0.555 


0.285 


16 


0.75 


0.25 


0.4 


0.375 


0.225 


17 


0.25 


0.75 


0.2 


0.525 


0.275 


18 


0.65 


0.35 


0.36 


0.405 


0.235 


19 


0.35 


0.65 


0.24 


0.495 


0.265 


20 


0.55 


0.45 


0.32 


0.435 


0.245 


21 


0.855 


0.145 


0.3 


0.4 


0.4 


22 


0.654 


0.346 


0.4 


0.2 


0.4 



We can acquire a lot of probabilistic information for the rule with using the 
above method from these data as follows. 



8.131241pii + 3.427759p2i = 4.3121 
3.427759pii + 7.013241p2i = 2.4079 



So, we have, pn = 0.4856369, P 21 = 0.1059786. 
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In the same way, we can obtain, pi 2 = 0.2793863,^22 = 0.59 5 241; pi 3 = 
0.4059914,^23 = 0.29 49115. 

In order to assure the probability significance level of the prior probabilities, 
the results should be: 



pii = 0.4856369/(0.4856369+0.2793863+ 0.4059914) = 0.414715, 

andpi2 = 0.238585, pi3 = 0.3467; p2i = 0.10639,p22 = 0.597553,^23 = 0.29 6 0 57. 
That is, we acquire a conditional probability matrix My\x for th® above rule as 
follows 



My\x 



Pll Pl2 Pl3 




P21 P22 P23 





0.414715 0.238585 0.3467 
0.10639 0.597553 0.296057 



If allowance error e is equal to or less than 0.3, then X ^ Y with conditional 
probability matrix My\x can be extracted as a valid probabilistic rules, which 
its support is 1. 



5 Experiments 

To study the effectiveness of our model, we have performed several experiments. 
Our server is Oracle 8.0.3, and the algorithm is implemented on Sun SparcServer 
using Java, and JDBC API is used as the interface between the program and 
Oracle. 

To evaluate our model, we have used a kinds of datasets: relational proba- 
bilistic databases, which are randomly generated according to the probabilistic 
data in the probabilistic databases. Our experimental results demonstrate that 
the approach in the kinds of datasets is efficient and promising. 

The main properties of the data sets are the following. The sizes {attriN) of 
attributes in datasets are in 5 to 15. The numbers (dsetN) of data of interest 
are approximately 100, 1000, 10000, 100000. They are listed as Table 4. 



Table 4. Synthetic data set characteristics 



Data set name 


attriN 


dsetN 


A5.D100K 


5 


100 


A5.D100K 


5 


10000 


AlO.DlOOK 


10 


100 


AlO.DlOOOK 


10 


1000 


AlO.DlOOOOK 


10 


10000 


A15.D100000K 


15 


100000 



To improve statistical model statisticalm (Algorithm 1), we have also de- 
signed two algorithms randommodel (It is called as random search model that 
performs on a set of instances: random samples) and partitionm (It is first to 
partition the data set into several subsets and then the random search model is 
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applied to these subsets). The experiments are designed to test the effectiveness 
of our designed algorithms. For the group of data in Table 4, the performances 
of proposed algorithms are depicted in Figure 1. 




Fig. 1. The running time 



In Figure 1, algorithm partionm presents the best performance. 

6 Conclusion 

Today’s database systems must handle uncertainties in the data they store. So 
mining probabilistic databases is necessary to applications. To our knowledge, 
no work on probabilistic database mining. Researches on association analysis 
such as [1] and [7] seem quite related to this work. However, previous mining 
techniques cannot work well on probabilistic databases due to the fact that 
probabilistic databases are in INF. We proposed a new model for discovering 
useful dependent rules in probabilistic databases in this paper by partitioning. 
We evaluated the proposed technique, and our experimental results demonstrate 
that the approach is efficient and promising. 
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Abstract. In the context of logic program updates, a knowledge base, 
which is presented as a logic program, can be updated in terms of an- 
other logic program, i.e. a set of update rules. In this paper, we investigate 
the complexity of logic program updates where conflict resolution on de- 
feasible information is explicitly taken into account in an update. We 
show that in general the problem of model checking in logic program 
updates is co-NP-complete, and the corresponding inference problem is 
II2 -complete. We also characterize particular classes of update specifica- 
tions where the inference problem has a lower computational complexity. 
These results confirm that logic program update, even if with the issue 
of conflict resolution on defeasible information to be presented, is not 
harder than the principal update tasks. 



1 Introduction 

In the context of logic program updates, a knowledge base, which is presented as 
a logic program, can be updated in terms of another logic program, i.e. a set of 
update rules. While the semantics and properties of logic program update have 
been studied by many researchers recently, e.g. [1,9], its computational com- 
plexity still remains unclear when conflict resolution on defeasible information 
is taken into account in logic program updates. In this paper, we investigate the 
complexity problem of logic program updates where conflict resolution on de- 
feasible information is explicitly taken into account in our update problems. We 
show that in general the problem of model checking in logic program updates 
is co-NP-complete, and the corresponding inference problem is 7T|^-complete. 
We also characterize particular classes of update specifications where the infer- 
ence problem has a lower computational complexity. These results confirm that 
logic program update, even if with the issue of conflict resolution on defeasible 
information to be presented, is not harder than the principal update tasks. 

The paper is organized as follows. In section 2 we briefly review the priori- 
tized logic program which will be used as a basis for our logic program update 
formulation. In section 3 we develop a logic program update framework in which 
both contradiction elimination and conflict resolution are taken into account. 
In section 4 we analyze the computational complexity of logic program updates 
in detail, while in section 5 we investigate under what conditions the inference 
problem in an update can be simplified. Finally, in section 6 we conclude this 
paper with some remarks. 



M. Brooks, D. Corbett, and M. Stumptner (Eds.): AI 2001, LNAI 2256, pp. 631—642, 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 
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2 Prioritized Logic Programs: A Review 

In this section we briefly review prioritized logic programs (PLPs) proposed by 
Zhang & Foo [10] . To specify PLPs, we first introduce the extended logic program 
and its answer set semantics developed by Gelfond and Lifschitz [7]. A language 
L of extended logic programs is determined by its object constants, function 
constants and predicates constants. Terms are built as in the corresponding first 
order language; atoms have the form P{ti, • • • , t„), where (1 < t < n) is a term 
and P is a predicate constant of arity n; a literal is either an atom P(ti, • • • , 
or a negative atom ^P{ti, • • • , t„). A rule is an expression of the form: 

pQ ^ P15 ' ' ' Tna^ * * * , notLji^ (1) 

where each (0 < i < n) is a literal. Lq is called the head of the rule, while 
Pi, • • • ,Lm,not Lm+i, ■ ■ •, not P„ is called the body of the rule. Obviously, the 
body of a rule could be empty. A term, atom, literal, or rule is ground if no 
variable occurs in it. An extended logic program P is a collection of rules. 

To evaluate an extended logic program, Gelfond and Lifschitz proposed an- 
swer set semantics for extended logic programs. Let II be an extended logic 
program not containing not and Lit the set of all ground literals in the language 
of n. The answer set of II, denoted as Ans{II), is the smallest subset S of Lit 
such that (i) for any rule Pq ^ Pi, • • • , Pm from II, if L\, - ■ ■ , Lm G S, then 
pQ G S', and (ii) if S contains a pair of complementary literals, then S = Lit. 
Now let n be an arbitrary extended logic program. For any subset S of Lit, let 
be the logic program obtained from II by deleting (i) each rule that has a 
formula not L in its body with L G S, and (ii) all formulas of the form not L in 
the bodies of the remaining rules^ . We define that S is an answer set of P iff S' 
is an answer set of II^ . An extended logic program P is well defined if it has a 
consistent answer set. 

The language of PLPs is a language £ of extended logic programs with 
the following augments: 

- Names: N, Ni, N 2 , ■ ■ ■■ 

- A strict partial ordering < on names. 

- A naming function Af, which maps a rule to a name. 

A PLP 7^ is a triple {II, M , <), where P is an extended logic program, Af is a 
naming function mapping each rule in P to a name, and < is a strict partial 
ordering on names. The partial ordering < in 7^ plays an essential role in the 
evaluation of V. We also use V{<) to denote the set of <-relations of V. Intu- 
itively < represents a preference of applying rules during the evaluation of the 
program. In particular, if J\f{r) < M{r') holds in V , rule r would be preferred to 
apply over rule r' during the evaluation of V (i.e. rule r is more preferred than 
rule r'). Gonsider the following classical example represented in our formalism: 



Vi- 

Ni : Fly{x) ^ Bird{x), not ^Fly{x), 

^ We also call P® is the Gelfond-Lifschitz transformation of P in terms of S. 




The Complexity of Logic Program Updates 633 



N 2 : ^Fly{x) ^ Penguin{x), not Fly(x), 

N 3 : Bird{Tweety) 

N 4 : Penguin{Tweety) 

N 2 < Ni. 

Obviously, rules Ni and N 2 conflict with each other as their heads are comple- 
mentary literals, and applying 7Vi will defeat N 2 and vice versa. However, as 
N 2 < Ni, we would expect that rule N 2 is preferred to apply first and then 
defeat rule Ni so that the desired solution ~^Fly{Tweety) can be derived. In a 
PLP or an extended logic program, we usually view a rule including variables 
to be the set of all ground instances of this rule formed from the set of ground 
literals in the language. 

Definition 1. Let II he a ground extended logic program and r a rule with the 
form Lq ^ Li, ■ ■ ■ , L^, not L^+i, ■ ■ ■, not (r does not necessarily belong to 
n ). Rule r is defeated by II iff n has an answer set and for any answer set S 
of n, there exists some Li G S, where m + 1 < i < n. 

Sometimes, it is also convenient to say that a set S of ground literals defeats 
a rule r if there is some literal L in S' and r has a form Lq <— ■ ■ ■ , notl, ■ ■ Now 
our idea of evaluating a PLP is as follows. Let P = {II, Af, <). If there are two 
rules r and r' in II and M{r) < Af{r'), r' will be ignored in the evaluation of V , 
only z/ keeping r in II and deleting r' from 77 will result in a defeat of r' , i.e. 
r' is defeated by 77 — {r'}. By eliminating all such potential rules from 77, V is 
eventually reduced to an extended logic program in which the partial ordering < 
has been removed. Our evaluation for P is then based on this reduced extended 
logic program. 

The evaluation of a PLP will be based on its ground form. That is, for any 
PLP P = {n,Af,<), we consider its ground instantiation P' = (77', A/"', <'), 
where 77', Af and <' are ground instantiations of 77, Af and < respectively^. 
However, to ensure that ordering <' is well behaved, i.e. <' is also a strict 
partial ordering and every non-empty subset of 77' has a least element with 
respect to <', we require that P = {II,Af,<) be well formed: there does not 
exist a rule r' in 77' that is an instance of two different rules ri and r 2 in 77 and 
Af{ri) < Af{r 2 ) G P{<). 

Definition 2. Let P = {II,J\f,<) be a ground prioritized extended logic pro- 
gram. P'^ is a reduct ofP with respect to < if and only if there exists a sequence 
of sets Ili (z = 0, 1, • • ■) such that: 

1. 77o = 77; 

2. Ili = 77i_i — {ri,r 2 ,--- | (a) there exists r G Fli-i such that for every j 
(j = 1, 2, • • ■), Af{r) < Af{rj) G P{<) and ri, • • • , are defeated by Ili-i — 

■ ■ ■}, and (b) there does not exist a rule r' G Ili-i such that N{rj) < 
N{r') for some j (j = 1,2,- ■ ■) and r' is defeated by Fli-i — {r'}}; 

^ Note that if P' is a ground instantiation of V , then Al{r\) < Af{r 2 ) G V{<) implies 
Af'fr'i) <' N{r' 2 ) e P'{<'), where r{ and r 2 are ground instances of ri and 
respectively. 
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3 - ^< = n“o^- 

Definition 3. Let V = {II, Af, <) be a PLP and Lit the set of all ground literals 
in the language ofV. For any subset S of Lit, S is an answer set of V iff S is 
an answer set of some reduct of V . A ground literal L is entailed from V, 
denoted asV \= L, if L belongs to every answer set ofV. V is called well defined 
if it has a consistent answer set. 

Using Definitions 2 and 3, it is easy to conclude that Vi has a unique reduct 
as follows: 

Vf = {-^Fly{x) ^ Penguin{x), not Fly{x), 

Bird{Tweety) , Penguin{Tweety) <— }, 

from which we obtain the following answer set of V \ : 

{Bird{Tweety), Penguin{Tweety), ~^Fly{Tweety)} . 

3 Logic Program Updates 

Given a knowledge base which is represented as a logic program (i.e. a finite 
set of rules), we consider the problem of how to update this knowledge base 
in terms another logic program (i.e. a set of update rules). In our context, 
since both knowledge base IIo and the set of update rules IIi are expressed as 
extended logic programs where rules in IIq or 7Ti may contain both classical 
negation and negation as failure, there are two essential issues to achieve this 
kind of logic program update: eliminating contradictory rules between LIq and 
III and solving conflicts among rules in IIq and TTi. That is, if a rule r in IIq 
contradicts some other rule(s) in iTi, we should remove r from IIq. But we 
must be aware that removing r from IIq may have effects on other rules. On the 
other hand, if a defeasible rule r in IIq conflicts with another rule r' in ili, we 
should have some way to solve thus conflict. These ideas are elaborated as follows. 

Eliminating contradictory rules 

To eliminate contradictory rules from LIq with respect to LIi, it cannot be 
simply to extract a maximal subset LI of LIq by requiring iT U 7Ti to be well 
defined. For instance, suppose LIq = {P ^,R ^ Q} and IIi = {Q ^ P, ~^R ^ 
Q}. Clearly, both II = {P and W = {i? <— P} are maximal subsets of IIq 
such that n U III and W U II i are well defined. But intuitively, rule R ^ Q 
represents a contradictory semantics compared with rule ^R ^ Q in IIi, and 
hence we would like to delete R ^ Q instead of deleting P ^ from IIq . 

To achieve this purpose, we first update each answer set S of IIq with 
III, which we call simple fact update. The result of this update is a set of 
ground literals, denoted as S', which has minimal difference from S and 
satisfies each rule in iTi. If S' is consistent, we then extract a maximal subset 
n(no,ni) 3Iq such that S' is coherent with II(no,ni) U IIi, i.e. S' is a 
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subset of an answer set of IIf^no,ni) U ili. Doing this, -/ 7 (t 7 o,/ 7 i) is guaranteed 
to maximally retain rules of Uq which are not contradictory to rules of 
ill- The program iI(77o,-/7i) is called a transformed program from IIq with 
respect to Ui. If no consistent S' exists, on the other hand, is simply 

specified to be any maximal subset of Uq such that TT(/7g 77^) UTTi is well defined. 

Solving conflicts 

After transforming Uq to n^YJo,ni)i we need to solve possible conflicts be- 
tween rules in f 7 ( 77 o,T 7 i) ^cid TTi. We call this phase program update. To do so, 
we specify a prioritized logic program V(no,ni) ~ , <), where 

for each rule r in Ui and each rule r' in f7(_/7Q,77i)) we specify M{r) < Af{r'). The 
intuitive idea behind this is that rules in iTi should be more preferred than rules 
in LI(77 o.77i) ill expresses the agent’s latest knowledge. Whenever there is a 
conflict between rules r and r' where r G IIi and r' e respectively, r 

will override r' . Then, we finally specify the possible resulting program iTg after 
updating TTg with Ui to be a reduct of V(no,ni) ~ i.e. 

^(77o ni) Definition 3 ). 

Now we give the formal definition for simple fact update. Let Uq and 7 Ti be 
two extended logic programs, and S an answer set of TTg. We specify Cnew to be 
a language of PLPs based on IIq and TTi’s language C with one more augment: 
For each predicate symbol P in C, there is a corresponding predicate symbol 
New-P in C^ew with the same arity of P. To simplifying our presentation, in 
Cnew we use notation New-L to denote the corresponding literal L in £. We 
use Litnew to denote the set of all ground literals of Cnew Clearly, Litnew = 
Lit U {New-L \ L G Lit}. 

Definition 4. Let S be a consistent set of ground literals and II an extended 
logic program. The specification of updating S with II is defined as a PLP of 
^new! denoted as Update{S,II) = {II*, Af, <), as follows: 

1 . n* consists of following rules: 

Initial fact rules .' for each L in S, there is a rule L 

Inertia rules .' for each predicate symbol P in L, there are two rules: 

New-P{x) ^ P{x), not ^New-P{x), and 

^New-P(x) ^ ^P{x),not New-P{x), 

Update rules.' for each rule Lq ^ Li, ■ ■ ■ , Lm,not Lm+i, • • -,not in IIi, 
there is a rule Ncw-Lq ^ New-Li, - ■ ■ , New-Lm,not New-Lm+i not 
New-Lrit 

2 . For any inertia rule r and update rule r' , Af{r) < Af{r') . 

Definitions. ^Simple Fact Update^ A set of ground literals of C, S', is 
called a possible result with respect to the update specification Update{S, II), iff 
for some answer set S* of Update{S , II) , S' = {L \ New-L G S'*}. 

The update specification Update{S, II) defined in Definition 4 provides a for- 
mal method to derive the possible result of updating S with Ui. In Particular, 
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inertia rules in 77* guarantee a minimal change during the update, i.e. any ini- 
tial fact in S that is not explicitly changed will persist by default. Update rules, 
on the other hand, specify effects of the update. Since both inertia and update 
rules may be defeasible^, possible conflicts may occur between them. Further- 
more, from the minimal change principle, initial facts in Ans{IIo) are always 
preferred to persist during the update whenever there is no explicit violation of 
update rules. So we specify that inertia rules are more preferred than update 
rules. Finally, the possible result of this update is derived from answer sets of 
Update{S,n) as presented in Definition 5. 

Lemma 1. Let Update{S,II) he a well defined update specifieation as specified 
in Definition 4- S' is a possible result with respect to Update{S, II) if and only 
if S' is an answer set of PLP V = (77 U {L ^ notL \ L G S'}, Af, <), where for 
each rule r : L ^ notL with L G S, and each rule r' in 77, M{r) < A/”(r')^. 

Definition 6. Let LIq and LIi he two extended logic programs, S an answer set 
of LIq and Update{S, LIi) as specficied in Definition 4- A subset II^noMi) 
called a transformed program from LIq with respec to LIi, iff (1) if U Pdate{S, LI) 
has a consistent answer set S', then LI(^jjo,ni) is a maximal subset of LIq such 
that S' is coherent with II(no,ni) U 77i; (2) if U pdate{S , LI) has no consistent 
answer set, then IL(no,ni) is any maximal subset of LIq such that IL(no,ni) U 77i 
is well defined. 

Definition 7. (Program Update} Let IL{na,ni) be defined as in Definition 
6. The specification of updating LIq with LIi, denoted as P-Update{LlQ,LIi), is 
a PLP (77(t7q Tjj) U LIi,N , <), where for each rule r in LIi and each rule r' in 
77(77^ 77i), A/"(r) < Af{r'). LI'q is called a possible resulting program after updating 
LIq with III iff n'o is a reduct ofV(no,ni)- 

Example I. Given two extended logic programs LIq = {P R ^ not Q, Q ^ 
not 77} and 77i = {^P ^}. Consider an update of LIq with 77i. Firstly, to find 
out the contradictory rule in LIq with respect to 77i, we update every answer set 
of LIq with 77i. Clearly, LIq has two answer sets {P,R} and {P,Q}. Updating 
these two answer sets with 77i, we get {^P,R} and {^P,Q} respectively. From 
Definition 7, it is not difficult to conclude that program II(no,ni) = {S ^ not 
(5, Q <— not R} is the unique transformed program from LIq with respect to 
III (be. both {P,R} and {P,Q} are coherent with II(^Ug U^^ U 77i). Secondly, 
to solve possible conflicts between rules in II^noMi) £^nd 77i, we specify a PLP 
P(77o,i7i) = (b7(T7o./7i) U 77i,7U, <) as follows: 

Ni : R ^ not Q, 

N 2 '■ Q ^ not R, 

Nq : ^P 

7V3 < 7Vi , 7V3 < N 2 ■ 



® Weak negation not may be included in these rules. 
^ L stands for the complement of literal L. 
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Finally, it is concluded from Definition 3 that V[no,ni) reducts {^P 

R ^ not Q} and {^P Q ^ not R}, which, as specified in Definition 8, are 
the two possible resulting programs after updating IIq with iTi. □ 

4 Complexity of Logic Program Updates 

In this section, we address the issue of computational complexity of logic program 
updates. From previous presentations, it is clear that our update consists of two 
steps: the simple fact update and program update, where the former is used to 
remove contradictory rules from the knowledge base and provides a basis for the 
program update. It is also easy to see that a simple fact update is a special case 
of program update®. For this reason, our complexity analysis will be based on 
these two phases. In the rest of this section, we assume that all programs are 
finite propositional programs. 

We first introduce necessary notions of the complexity theory, where further 
descriptions are referred to [5]. The class consists of the problems solvable 
by a polynomial-time deterministic Truing machine with an oracle for a problem 
from C, where the class NP^ includes the problems solvable by a nondetermin- 
istic Turing machine with an oracle for a problem in C. Let C be a class of 
decision problems, by co-C we mean the class consisting of the complements of 
the problems in C. 

The classes and 11^ of the polynomial hierarchy are defined as follows: 
= n^ = P and 

E[ = NP^^‘-^, =co-S[ for all k > 1. 

It is easy to see that NP = co-NP = n[‘, and = NP^^. It is also 
observed that in general P C NP, P C co-NP, NP C E^ , and co-NP C 7J|^. 
Each inclusion relation is usually believed to be proper. A problem A is complete 
for a class C if A € C and for every problem B in C there is a polynomial 
transformation of B to A. 

4.1 Complexity of Simple Fact Update 

Now we consider the simple fact update. Given a consistent set S of ground 
literals and an extended logic program 77, the update of S with 77 is specified by 
the corresponding update specification Update{S, 77) which is a PLP as defined 
in Definition 5. From Lemma 1, we know that a set S' is a result with respect 
to Update{S, 77) if and only if S' is an answer set of a PLP 

P = {nu{L^ noth I 7 G S},N, <), 

where for each r : L ^ notL with L G S and each r' G 77, Af{r) < N{r'). So we 
call this V the equivalent PLP of update specification Update{S,II). From this 
result, it is clear that to evaluate Update{S, B), we only need to compute the 
answer set of V. 

® Given S and 77, if we view each literal 7 in S' as a rule 7 , updating S with 77 is 

equivalent to updating 77s with 77 where 77s = {7<— |7 gS}. 
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Proposition 1. Let II he a well defined program and r a rule. Deciding whether 
r is defeated by II is eo-NP-eomplete. 

Now we need to provide a characterization on the answer sets of the particular 
class of PLPs that are of the form as V above. Given a PLP V = (7TiU7T2, Af, <), 
where for each <-relation JV{r) < Af{r') in V, r G 7Ti and r' G 772. Let S be an 
answer set of 77i U 77i. We specify 7?<(S') = (^i(5'), ^2(«5')), where ^i(5') is a 
subset of III in which each rule is defeated by S, and '72(5') is the subset of II 2 
in which each rule is defeated by 5. 

Definition 8. Let V = (TTi Ull 2 ,Af, <) be specified as above, and S and S' two 
answer sets of IIi U 77i. We say S is more <-consistent with respect to V{<) 
than S', denoted as R^(S') Q R^{S), if and only if (1) <Pi{S) C L>i{S') (proper 
set inclusion); or (2) d>i{S) = (I>i{S') and <72(5') C <72(5). 5 is maximally 
< -consistent with respect to V{<) if there does not exist another S" such that 
77<(5) C i?<(5")®. 

Intuitively, a maximal <-consistent answer set of 77i U II 2 with respect to 
V{<) defeats a minimal number of rules in 7Ti and a max;imal number of rules 
in 772. Since for each J\f{r) < Af{r') in V{<), r G IIi and r' G LI 2 , this property 
ensures that 5 is evaluated by applying rules in 77i first, which is consistent with 
the intuition of <-relations specified in 7^(<). In general, we have the following 
result. 

Lemma 2. Let V = (77i U Il 2 ,Af,<) be a PLP, where for each <-relation 
Af{r) < N{r') in V, r G 77i and r' G 772. A. set of ground literals S is an 
answer set of V if S is a maximal <-consistent answer set of 77i U II 2 with 
respect to V{<). 

Theorem 1. LetV be the equivalent PLP of an simple fact update specification, 
and S a set of ground literals. Deciding whether S is an answer set of V is co- 
NP-complete. 

Proof. {Proof Sketch). Due to a space limit, we only outline our proof idea here. 
According to our previous description, 7^ is a PLP of the form V = (77 U {7 ^ 
notL I L G 5, Af, <), where for each r \ L <— notL with L G S and each 
r' G 77, A7(r) < N{r'). So V satisfies the condition of Lemma 2. Then the 
membership is easy to show from Lemma 2. To prove th hardness part, we can 
reduce the well known NP-complete satisfiability problem to the complement 
of our problem. In particular, for a given collection of nonempty propositional 
clauses C = {Ci, - ■ ■ ,Cm} on propositional letters Pi,---,P„, we construct a 
PLP V = {III U 772, A/", <), where each rule in 77i has the form L ^ notL, and 
V{<) is defined to be the set Af{r) < M{r') for any r G IIi and r' G II 2 , such 
that an answer set 5 of 77i U LI 2 is not maximally <-consistent with respect to 
V{<) iff C is satisfiable. Since our construction can be done in polynomial time, 
this proves co-NP-hardness. □ 

® 7?<(S) C 7?<(S") means R<{S) C R<{S") and R<{S") g R<{S). 
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The following result presents the complexity of inference associated with the 
simple fact update. 

Theorem 2. Let V he an arbitrary finite propositional PLP. For a given ground 
literal L, deciding whether V \= L is II 2 -complete. 

Proof. The membership proof is straightforward. Here we only give the hardness 
proof. From [4], we know that disjunctive logic program entailment is iT|^-hard. 
This result still holds even if the disjunctive logic program is head-cycle free^ . 
We will construct a PLP V from a head-cycle free disjunctive logic program T> 
such that there exists an one-to-one correspondence between stable models of T> 
and answer sets of V . Since our construction is in polynomial time, this proves 
II 2 -hardness. 

Let I? be a propositional disjunctive normal logic program that involves 
propositional letters Pq, - ■ ■ , Pn. The PLP V we will construct from T> involves 
Pi together with Pi (i = 1, • • • , n). Let V = {II, Af, <). Firstly, we specify II to 
consist of the following rules: 

(a) For each rule PiV ■ ■ ■ ,Pk ^ Pk+i,- • • ,Pi, notPi+i,- ■ ■ , notPm in T>, we spec- 
ify k rules in 77 as follows: 

Pi ^ Pfc+i, • • • , Pi,notPi+i, • • • , notPra, notP 2 , • • • , notPk, 

P 2 ^ Pk+I, ■ ■■ , Pi,notPi+i, • • • , notPra, notPi,notPs, • • • , notPk, 

' ' ’ 

Pk ^ Pfc+i, • • • , Pi,notPi+i, • • • , notPm, notPi, • • • , notPk-i; 

(b) For each Pi {i = 1, • • • , n), if Pi does not occur as a head of some rule in II 
as specified in Step (a), then rule Pi ^ not Pi is included in 77. 

Now the <-relation in V is specified as follows: 

Let r be a rule in 77 specified in Step (a), and r' a rule in 77 specified in 
Step (b), then Af{r) < Af{r'). 

Now we prove that there is a one-to-one correspondence between 7?’s stable 
model and P’s answer set. Firstly, we prove that if S' is a stable model of T>, 
then S' = S U {7^ | P ^ S} is an answer set of P. Let II = IIiU II 2 , where Pi 
includes all rules specified in Step (a) and II 2 includes all rules specified in Step 
(b) in the above process. 

As there is no propositional letter P occurring in Pi, we can prove that S is a 
stable model of P iff S is a stable model of Pi. Let T>^ be the Gelfond-Lifschitz 
transformation of T>. Clearly, from the stable model definition for disjunctive 
logic program, S is the minimal model of program , in which no negation 
as failure sign not occurs in the body of each rule. Then for each rule in of 
the form Pi V • • • V Pfc ^ Pfc-i-i, • ■ ■ ,Pi, if Pfc+i, ■ ■ • ,Pi are in S, then only one 
of Pi, • • • ,Pfc will be in S. Now consider program Pf - the Gelfond-Lifschitz 
transformation of Pi with respect to S. It is observed that for each rule Pi V 
• • • V Pfc ^ Pfc+i, • • • , P; in if P* (1 < 7 < k) is in S, then there is a rule 
Pi ^ Pk+i, ■ ■ ■ ,Pi in Pf , where 7 — 1 rules of the forms 

^ Readers are referred to [2] for the definition of a head-cycle free disjunctive logic 
program. 
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Pi ^ Pfc+i, • • • , Pi,notPi+i,- ■ ■ , notPm, notP 2 , notPs, • • • , notPk, 

' ' ' ? 

Pi -1 ^ Pfc+i, • • • , Pi,notPi+i,- ■ ■ , notPm, notPi, ■ ■ ■ , notPi- 2 , notPi, ■ ■ ■ ,Pk, 

Pi+i ^ Pfc+i, • • • , Pi,notPi+i, • • • , notPm,notPi, ■ ■ ■ , notPi, notPi+ 2 , ■ ■ ■ ,Pk, 

' ' ' ? 

Pfc ^ Pk+i,- ■ ■ , Pi,notPi+i,- ■ ■ , notPm, notPi,- ■ ■ , notPk-i', 

in 7Ti are eliminated from . This follows that S is the minimal set such 
that for each rule Pi ^ Pk+i, ■ ■ ■ ,Pi in , Pk+i, ■■■ ,Pi in S implies Pi in S. 
Therefore, S is also a stable model of IIi. With a similar way, we can prove that 
if S' is a stable model of TTi, then S is also a stable model of T>. 

So far, we have showed that S is a stable model of I? iff S is a stable model 
of ill- Then it is easy to see that each reduct of V has a form 7Ti U iT|, where 
II 2 C II 2 ■ Clearly, for any rule P ^ notP in II 2 that is eliminated in 7T| , P is in 
every answer set of TTiUiT^ • Furthermore, as head{Il 2 )Pbody {Ui) = 0, according 
to the Generalized Splitting Theorem [11], it concludes that SU{P|P^S}is 
an answer set of V. 

Similarly, we can also prove that if S is an answer set of P, then S' = S — {P \ 
P € S'} is a stable model of T>. This shows the one-to-one correspondence between 
V and P®. So the result holds. □ 

4.2 Complexity of Program Update 

Now we consider the complexity of program update. Given extended logic pro- 
grams Po and Pi, updating Pq with Pi consists of two steps: the first step is 
to obtain a subset P(/7o,T7i) with respect to some answer set of Pq, which is 
to eliminate those contradictory rules from Pq, and then to specify the update 
specification P-Update{UQ, Pi) which is a PLP as defined in Definition 7, which 
is to solve the conflict between P(t7o./7i) and Pi. Hence, computing a resulting 
program Pg after updating Pg with Pi consists of two components: computing 
F7(t7o,/7i) and computing a reduct of P-Update{IIo, IIi) . We first analyze the 
computational complexity for the first step - contradiction elimination. 

Lemma 3. Given an extended logic program P and a set of ground literals S. 
S is coherent with P if and only if program n U {L ^ \ L G S} is well defined. 

Theorem 3. Let S be a consistent set of ground literals and P a program. 
Deciding whether S is coherent with P is NP-complete. 

The following theorem states that the check for contradiction elimination in 
the first step of a program update is co-NP-complete. 

Theorem 4. Let S be a consistent set of ground literals, LI a program, and LI' 
a subset of P. Deciding whether LI' is a maximal subset of LI such that S is 
coherent with LI' is co-NP-complete. 

® Ben-Eliyahu and Dechter also showed the equivalence between a head-cycle free 
extended disjunctive logic program and its translation to an extended logic program 
under the answer set semantics [2] 




The Complexity of Logic Program Updates 641 



The following two theorems give the complexity of model checking and infer- 
ence in program update respectively. It is interesting to note that the inference 
problem in a program update is eventually not harder than that in a simple fact 
update as the following theorem states. 

Theorem 5. Let P-Update{IIo, IIi) be a logic program update specification. De- 
ciding whether a set of ground literals S is an answer set of some resulting 
program of P-Update{IlQ, IIi) is co-NP-complete. 

Theorem 6. Let P-Update{nQ, LIi) be a logic program update specification. 
Then given a ground literal L, deciding whether L is entailed by every result- 
ing program of P-Update{]Jo, IIi) is II 2 -complete. 

5 Inference with Lower Complexity 

In this section, we try to further characterize specific classes of update specifi- 
cations where the inference problem has a lower computational complexity. We 
first consider the simple fact update. Let Update{S, LI) be an update specifica- 
tion and V = {nu{L ^ notL},Af, <) be the equivalence PLP of Update{S, LI) 
as we described in section 4.1. Obviously, if V has a unique reduct, say LI*, 
then for any given ground literal L, the problem of deciding whether 7^ ^ L is 
reduced to the problem of deciding whether LI* ^ L. As LI* is an extended logic 
program, and we know that the inference problem for extended logic program is 
co-NP-complete [2], this follows that the inference problem in this specific class 
of update specifications is lower than the general case. The key idea about this is 
to extend the notion of local stratification of normal logic programs to extended 
logic programs by treating each negative literal as a new propositional atom [6, 
7,11]. Then we have the following result. 

Theorem 7. Let Update{S,LI) be a simple fact update specification, and V = 
{LI yj {L ^ notL I L G B},Af,<) be the equivalence PLP of U pdate{S , LI) . If 
S n body{LI) = 0 and LI is locally stratified, then for a given ground literal L, 
deciding whether V \= L is co-NP-complete. 

The above theorem is achieved by proving a condition of unique reduct: if 
S n body (II) = 0 and LI is locally stratified [6], then V has a unique reduct 
LI* [11]. Therefore, deciding whether 7^ ^ L is equivalent to decide whether 
LI* ^ L. If we know the unique answer set of 77, we then have the following 
corollary. 

Corollary 1. Let Update{S,LI) be a simple fact update specification. If S D 
body{n) = 0 and LI has a unique answer set S* . Then the resulting knowledge 
base S' with respect to Update{S,LI) can be computed in 0{\m\ • jnj) time, where 
\m\ and jnj are cardinalities of sets S and S* respectively. 

Similarly to the case of simple fact update, we can also characterize a particu- 
lar class of program update specifications where the inference problem associated 
to these update problems is reduced to co-NP-complete. 
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Theorem 8. Let P-Update{IIo, Hi) = (iT(77o,/7i) U IIi, Af,<) be a program 
update specification. If head{IIi) n body{IIo) = 0 and LIq is locally stratified, 
then for a given ground literal L, deciding whether P-Update{IlQ, IIi) \= L is 
co-NP-complete. 

6 Conclusions 

In this paper, we investigated the semantics of logic program updates under the 
framework of prioritized logic programs, where priority is introduced to solve 
conflicts in updates. It turns out that the complexity of logic program updates 
remains at the same level of the polynomial hierarcy as principal update tasks 
[8] . It is also interesting to note that although representing a knowledge base as 
a set of rules adds expressive power in the domain, it actually does not introduce 
an increase in the complexity. This is because in our framework, a logic program 
update is specified by two separate steps: simple fact update and program up- 
date, and each of these two forms of updates has the same complexity. When a 
knowledge base is represented as a set of facts, the second step - program update, 
is identical to the first step and hence becomes unnecessary. 

Finally since the focus of this paper is on the complexity of logic program 
updates, we did not compare our approach with other relevant methods due to 
a space limit. This part is referred to [11]. 
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Abstract. In many multimedia application systems, it is not the final 
goal to retrieve the relevant multimedia information from different multi- 
media information sources. Rather, post-processing of the retrieved mul- 
timedia information is needed. For example, the retrieved information 
is used as “known facts” . The systems will do some reasoning to obtain 
further conclusions based on these multimedia form “known facts” . We 
call this reasoning with multimedia information. Most current research 
work in multimedia information processing is focused on multimedia 
information retrieval, but post-processing the retrieved information is 
more or less ignored. This paper explores the way to tackle this prob- 
lem by using symbolic projection. A case study of reasoning with still 
image information is presented. Some extensions to symbolic projection- 
introducing auxiliary pictorial objects in symbolic pictures that need to 
be processed-are discussed. We expect this paper will stimulate further 
research on this important but ignored topic. 



Keywords: Automated Reasoning, Geometric Reasoning, Spatial Reason- 
ing, Multimedia, Symbolic Projection 



1 Introduction 

The multimedia revolution started at the beginning of 1990’s [11]. Multimedia is 
virtually revolutionary in many areas such as communications, computing, en- 
tertainment, consumer electronics etc. Although there are still many challenging 
problems remain to be researched and resolved for the further growth of multi- 
media, with the advent of relatively cheap, large online storage capacities and 
advances in digital compression, comprehensive sources of text, image, video, 
and audio etc. multimedia data can be stored and made available for research 
and applications. Actually, the technical ability to generate volumes of digital 
multimedia data is becoming increasingly “mainstream” in today’s electronic 
world. 

With numerous digital libraries, other large multimedia databases, and multi- 
media based WWW pages available, sophisticated search or retrieval techniques 
are required to find relevant information in these large digital data repositories. 
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These retrieval techniques should provide not only fewer bits but also the right 
bits to users. 

We notice that to retrieve the right bits from multimedia data repositories is 
not the final step in many multimedia applications. Further processing of the re- 
trieved multimedia information is needed. In this paper, we call such multimedia 
information processing post-processing. For example, in our ongoing agent-based 
financial investment adviser project [1], we need to retrieve stock market infor- 
mation. These data include breaking news in audio or text form that affects 
stock market, closing price of specific securities (table), security price or moving 
average chart (see Figure 1), company revenue (pie chart), company profiles etc. 
In our financial investment advising multi-agent system, different agents are del- 
egated to gather information in different forms. The decision making agents will 
make financial investment decisions based on their domain knowledge and the 
retrieved multimedia information. To this end, we come across two problems: 
How to fuse information in different media forms and how to reason with these 
multimedia information to obtain new results. These two “how to”s are most 
important topics in multimedia information post-processing. 

With these observations in mind, we proposed that multimedia information 
processing be divided into three levels-multimedia information storage, retrieval, 
and post-processing [2]. Currently, most work on multimedia information pro- 
cessing is focused on multimedia information storage and retrieval, especially 
indexing and content-based access of multimedia information. Post-processing 
of this information is nearly ignored. 

In [2], we identified the concepts of multimedia information post-processing. 
Two of the important topics in post-processing are fusion of multimedia infor- 
mation and reasoning with multimedia information. In this paper, we will try to 
deal with some issues in reasoning with multimedia information. The discussions 
are based on a subtask in our financial investment adviser project. 

To give advice about stock buying/ selling we need to do some reasoning based 
on the moving average chart of a specific security as well as many other analyses. 
There are two main trading rules for moving averages: (1) A buy signal is given 
when the price moves up and crosses over the moving average from below; (2) 
A sell signal is given when the price moves down and crosses over the moving 
average from above. Here, it is more natural and convenient to represent the 
condition parts of these rules as well as the “known facts” (retrieved moving 
average charts of some specific securities) by graphics (charts) . This is one form 
of reasoning with multimedia information. For this specific problem, it is actually 
a problem of reasoning with still images. We employ symbolic projection theory 
[3] to do such reasoning. Symbolic projection is a theory of spatial relations. This 
theory is the basis of a conceptual framework for image representation, image 
structuring and spatial reasoning. 

The essential lying in our problem is the matching of two moving average 
images (charts). Traditional approaches to such problems are measured on the 
basis of maximum-likelihood or minimum distance criterion. The symbolic de- 
scription of visual information such as shape or spatial relations is a very difficult 
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task using the traditional approaches. Attempts to describe this information tex- 
tually can lead to representations that are either too general (refer to the two 
trading rules) or too complex. Symbolic projection approach is more flexible and 
efficient [4] [5] [10]. That is why we turn to symbolic projection theory. 

When using symbolic projection to solve our moving average problem, we 
come across that directly using symbolic projection is not sufficient for our 
problem. Thus further extensions to symbolic projection are explored. A new 
concept-introducing benchmark object in symbolic pictures before applying sym- 
bolic projections-is developed. 

In short, our work in this paper extends the application scope of symbolic 
projection as well as the theory itself. Furthermore, the problem of reasoning 
with multimedia information is highlighted. 

The rest of the paper is organized as follows. Section 2 presents the problem 
we want to solve. Using symbolic projection to solve the problem is discussed in 
Section 3. Section 4 is some discussions. Finally, Section 5 is concluding remarks. 

2 Problem Descriptions 

When giving some advice to investment in stock market in our financial invest- 
ment adviser system, the system will use fundamental analysis, technical analysis 
of securities, and other domain knowledge. Fundamental analysis endeavors to 
determine the fair value of a share. The emphasis of technical analysis is on 
information generated by the market itself. Technical analysis is an attempt to 
forecast future prices by studying past prices. Traditionally, this has been done 
using various types of charts that provide a visual record of past prices. There are 
four conventional types of charts-the bar chart, the candlestick chart, the point 
and figure chart, and moving averages. In this paper, we take the processing of 
moving average charts as an example. 

A moving average of past prices can be used as an indicator of a price trend. 
There are two main trading rules for moving averages: 

— A buy signal is given when the price moves up and crosses over the moving 
average from below. 

— A sell signal is given when the price moves down and crosses over the moving 
average from above. 

Based on the trading rules, there are three rules that related to the moving 
average chart in the knowledge base of decision making agents: 

— Rule 1: If the moving average chart of a security is similar to Figure 2 (a), 
then buy this security; 

— Rule 2: If the moving average chart of a security is similar to Figure 2 (b), 
then sell this security; 

— Rule 3: If the moving average chart of a security is similar to Figure 2 (c), 
then don’t buy and sell this security. 
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Fig. 1. Example Moving Average Chart 
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Fig. 2. Moving Averages of Security Price 



Here, it is natural to represent the condition parts of these rules directly by 
graphics (chart). Actually, it is very difficult to represent them in text precisely 
whereas easy to operate. Now, if we retrieved some web-sites on the Internet and 
get the moving average chart of a specific security (Figure 3), how can we infer 
the conclusion using these rules? This is one form of reasoning with multimedia 
information. For this specific problem, it is actually a problem of reasoning with 
still images. 

We employ symbolic projection theory to do such reasoning. We represent 
the moving average chart as 2D strings in symbolic projection, and then use 
2D string matching algorithm to accomplish the reasoning. The details will be 
discussed in next section. 
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Fig. 3. Retrieved Moving Average of a Specific Security 



3 Solving the Problem by Symbolic Projection 

The theory of symbolic projection was first developed by Chang and co-workers 
[4]. It forms the basis of a wide range of image information retrieval algorithms. 
It also supports pictorial-query-by-picture, so that the user of an image infor- 
mation system can simply draw a picture and use the picture as a query. Many 
researchers have since extended this original concept, so that there is now rich 
body of theory as well as empirical results [5] [6] . The extended theory of sym- 
bolic theory can deal not only with point-like objects, but also objects of any 
shape and size. Moreover, the theory can deal with not only one symbolic pic- 
ture, but also multiple symbolic pictures, three-dimensional (3D) pictures, a 
time sequence of pictures, etc. 

A symbolic picture is a two dimensional matrix of symbols. Each object of 
the real image is represented by a symbol located in the centroid of the object. A 
symbolic picture can have at least two symbolic projections: the x-projection and 
y-projection. The x— or j/— projection of a symbolic picture can be constructed 
by projecting the names of objects in each column of the symbolic picture onto 
the X— or y— axis. A pair of two symbolic projections is called a 2D string. 

3.1 A Brief Introduction of Symbolic Projection 

Let 27 be a set of symbols, or the vocabulary. Each symbol might represent a 
pictorial object, a pixel, etc. 

Let A be the set {=, <, |}, where “=”, “<”, and are three special symbols 
not in 27. These symbols will be used to specify spatial relationships between 
pictorial objects. 

A ID string over 27 is any string X\X 2 . . . n > 0, where the Xi’s are in 27. 
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A 2D string over S, written as (u,v), is defined to be 

(^1^1^22/2 • ■ ■ Un—l^n^ • ■ ■ -^n— l^p(n)) 

where Xi . . . is a ID string over A, p : {1, . . . , n} ^ {1, . . . , n} is a permutation 
over n}, j/i, . . . , j/„_i is a ID string over A and Zi,. . . , Zn-i is a ID string 

over A. 

In the above, the symbol “<” denotes the left-right spatial relation in string 
u, and the below-above spatial relation in string v. The symbol denotes the 
spatial relation “at the same spatial location as”. The symbol “|” denotes “edge- 
to-edge” spatial relation (two objects are in direct contact either in the left-right 
or in the below-above direction) . Therefore, the 2D string representation can be 
seen as the symbolic projection of picture / along the x— and y— axes. 

A symbolic picture f is a. mapping M x M ^ W, where M = {1,2,..., m}, 
and W is the power set of S (the set of all subsets of V). The empty set { } 
then denotes a null object. 

Given /, we can construct the corresponding 2D string representation {u, v), 
and vice versa, such that all left-right and below-above spatial relations among 
the pictorial objects in E are preserved. In [3] (pp. 32-33), a formal algorithm 
(called 2Dstring) for constructing 2D string (u, v) from / is presented. 

2D string representation also provides a simple approach to perform sub- 
picture matching on 2D strings. An algorithm for 2D string matching (2Dmatch) 
is given in [3] (pp. 38-40). 

With the edge-to-edge operator we can further segment an object into 
its constituent parts. This is accomplished by introducing cutting lines. A sys- 
tematic way of drawing the cutting lines is as follows: First, the extremal points 
are found in both the horizontal and vertical directions. Next, vertical and hor- 
izontal cutting lines are drawn through these extremal points. This technique 
gives a natural segmentation of planar objects into the constituent parts. 

With the cutting mechanism, we can also formulate a general representation, 
encompassing the other representations based upon different operator sets. This 
consideration leads to the formulation of a generalized 2D string system [7] . 

A generalized 2D string system is a five-tuple {E, C, A, e, “(, )”), where E is 
the vocabulary; C is the cutting mechanism, which consists of cutting lines at 
the extremal points of objects; A = {<, =, |} is the set of spatial operators; e is 
a special symbol which can represent an area of any size and any shape, called 
the empty-space object; and “(, )” is a pair of operators which is used to describe 
local structure. 

The cutting mechanism defines how the objects in an image are to be seg- 
mented, and also makes it possible for the local operator “(,)” to be used as a 
global operator to be inserted into original 2D strings. 

The spatial operator set A can be extended to contain other spatial relation 
operators used in different applications. For extension examples, refer to [5] [6]. 
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Fig. 4. (a) Moving Average Chart Segmentation Using Cutting Lines; (b) Moving Aver- 
age With Benchmark Object; (c) Final Moving Average Segmentation With Benchmark 
Object 



3.2 2D String Representation of Moving Average Charts 

After we had some basic concepts about symbolic projections, we are now ready 
to discuss the 2D string representation of moving average charts. 

Take Figure 2 (b) as an example. The following cutting mechanism C is 
applied: Choose the upper horizontal extremal point, p, and draw two vertical 
cutting lines x = p + 5 and x = p — 6. Others remain the same as described in 
Section 3.1. The segmented moving average chart is shown in Figure 4 (a). 

The vocabulary is E = {a,b,c}. The 2D string representing the picture in 
Figure 4 (a) is as follows: 
u : a\b\c, v : c = a < b. 

Therefore, rule 2 in Section 2 can be expressed as: 

Rule 2: If tt : a|&|c and v : c = a <b then sell the security. 

Similarly, we can obtain the 2D string representations of other moving av- 
erage charts, and rewrite the corresponding rules. We can then employ the 2D 
string matching algorithm or other string matching algorithms to do the reason- 
ing. 



3.3 Special Considerations for Our Problem 

The above representation does not describe fully the meaning lies in the an- 
tecedents of trading rules. For example, in Figure 5, gi, g 2 , and gs have different 
slopes, but their 2D representations are the same. This causes one problem: No 
matter the price trend is indicated by gi, g 2 , or gs, a, sell signal will be given. In 
practice, when the price drops slightly (indicating by gs), no sell signal is given. 

How can we deal with this problem? This problem occurs due to two rea- 
sons. One is the descriptions of trading rules themselves are fuzzy. The other 
is the insufficient representation of the 2D strings. For such a problem, simply 
extending the spatial operator set A cannot help. One may argue that one gen- 
eralized symbolic projection-slope projection [3] (Chapter 9) - can be used to 
solve this problem. But if we remember that no object exists to compare with, 
we know we still cannot use it directly. What we need is a benchmark picture 
object that indicates whether the sell or buy signal should be given or not. This 
is our concept to introduce auxiliary object (s) in symbolic pictures that need to 
be processed. In our example, we call such an auxiliary picture object benchmark 
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picture object. The benchmark picture object can be determined by consulting 
stock technical analysis experts. 

We still take Figure 2 (b) as an example. After introducing the benchmark 
object, the resulting moving average chart is shown in Figure 4 (b) (the same 
cutting mechanism used in last subsection is applied, but with one more vertical 
cutting line through the upper extremal point). 

In Figure 4 (b), if the slope of price trend is greater than that of benchmark 
object s, no sell signal is given. If less than or equal to that of s, a sell signal 
is generated. The 2D string representation corresponding to Figure 4 (b) is as 
follows: 

u : a|&i |62 = Si|c = S 2 , 71 : c = a < S 2 = b 2 — bi < Si. 

This 2D representation can be further simplified as Si is useless for the 
reasoning, only S 2 can help. In our application, no reconstruction from the 2D 
strings is needed. Thus Si can be omitted in the 2D representation. We apply 
exactly the same cutting mechanism as that used in last subsection (see Figure 
4 (c)), we obtain the final 2D representation for our problem: 

u : a\b\c = s, v : c = a < s = b. 

For rule 2, if the 2D string of retrieved price trend is u : a|&|c = s, v : c = 
a < s = b, then a sell signal is given. Otherwise, if the 2D string of retrieved 
price trend is u : a\b\s = c, v : s = a < c = b, then no sell signal is given. 

With the introduction of benchmark object in the moving average charts, 
we can exactly represent the meaning in trading rules with 2D strings. One 
more problem remains to be solved: how can we add the benchmark object 
to a symbolic picture, and automatically construct the 2D strings from the 
corresponding picture? 

In our application, the benchmark object is a line segment starting from the 
extremal point, p. Assume the slope of the benchmark object (line segment) 
determining by stock technical analysis experts is k. To add the benchmark 
object is simply drawing a line segment starting from p with slope k. To construct 
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2D strings from the symbolic picture with added benchmark object, we can still 
use the 2Dstring algorithm. The extended 2Dstring algorithm that can handle 
the benchmark object problem, 2DstringE, is outlined below: 

Algorithm 2DstringE(ptcture, k, u, v) 

/* This algorithm takes symbolic picture picture and the slope of benchmark 

V 

/* object k as inputs, k is determined by stock technical analysis experts. */ 
/* Outputs are the 2D string representations (u,v) of picture. */ 

begin 

/* object recognition */ 

recognize objects in the picture; 

find the upper or lower extremal point p in picture; 

/* add benchmark object */ 

draw a line segment starting from p with slope equal to k; 

/* segmentation */ 

applying cutting mechanism C to picture; 
find centroid of each object; 

find 2D string representation using Procedure 2Dstring 
end 

4 Discussions 

The above discussion is focused on the processing of moving average charts using 
symbolic projection. In our financial investment application and many other 
applications, another common used chart form to represent retrieved information 
is pie chart. For example, when the system gives investment advice to the client, 
the system may analyze the revenues of the relevant companies or the trading 
volumes of some promising securities. They are usually represented by pie charts 
(refer to Figures 6 and 7). 

It is more difficult to reason with such kind of pie charts. If we know the 
boundary lines (see Figure 8) that separate different component parts in the 
pie chart, the task is relatively easy. Otherwise, we must first determine the 
boundary lines based on the grey levels of different component parts in the pie 
chart. After we get the boundary lines in the pie chart, we can convert the real 
pie chart to a symbolic picture. We then can use symbolic projection theory 
to process this problem. When using symbolic projection to process pie charts, 
polar projection and concentric cutting mechanisms must be used [9]. We will 
discuss the details in another paper. 

We successfully solved the problem of reasoning with curve information (still 
image) by using symbolic projection. Our work presented in this paper indicates 
that symbolic projection is a very promising methodology to solve the problem 
of reasoning with still image information. Of course, reasoning with other still 
image information (other than curve information) is subject to further research. 
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When solving the “reasoning with moving average charts” problem in our 
application, we introduced the concept of “benchmark objects”. By introducing 
benchmark objects into the moving average charts, we solved the problem effec- 
tively. This approach can be generalized and apply to many other problems. For 
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example, one kind of difficult problems, called dynamic route planning, can be 
solved by introducing extra “reference objects” in the symbolic pictures when 
using symbolic projection theory. This problem can be generally described as 
follows: A moving object (e.g., a robot) needs to determine its moving direction 
to an unfamiliar place. In the moving object’s memory, there are some reference 
objects. Once he observes enough reference objects in the environment, he can 
then decide which direction he will move to. For such kind of problems, we can 
represent the environment as a symbolic picture. When the moving object sees an 
object that is identical to one of the reference objects in his memory, we add this 
object to the symbolic picture as an auxiliary object. After introducing enough 
auxiliary objects, we can describe the relationships among all the objects in the 
symbolic picture, furthermore, decide the direction the moving object should 
forward to. This problem is similar to the problem of determination of the views 
of a moving object addressed by E. Jungert [12], but much more difficult. Once 
again, thoroughly solving this problem needs further research. 

As mentioned in the introduction section, the problem described in this paper 
can also be solved using traditional approaches based on maximum-likelihood or 
minimum distance criterion, or pattern recognition approaches. Those solutions 
are more time-consuming and complicated. 

The theory of symbolic projection was originally developed as a technique 
for iconic indexing to image databases, and this still is an important application 
area. Symbolic projection theory also includes other characteristics that made 
it suitable for various forms of spatial reasoning and in particular for qualitative 
spatial reasoning. We applied this theory for reasoning with still image infor- 
mation. These two concepts are different whereas have some relations. Spatial 
reasoning is the process of reasoning and making inferences about problems deal- 
ing with objects occupying space [8]. The emphasis is on the spatial relationships 
of objects. The focus of reasoning with multimedia information in general, still 
image information specific, is on the meaning or knowledge that lies behind the 
images. 



5 Concluding Remarks 

We identified that multimedia information consists of three levels - multimedia 
information storage, retrieval, and post-processing. In post-processing, reasoning 
with retrieved multimedia information to reach new conclusions is a paramount 
but relatively ignored topic. 

Reasoning with multimedia information includes reasoning with text infor- 
mation, reasoning with image information, reasoning with video information, 
and reasoning with audio information, etc. There are many techniques for rea- 
soning with text information - this is the task of traditional inference algorithms. 
There are few to deal with the rest. 

This paper proposed to use symbolic projection theory for reasoning with 
still image information. A case study of reasoning with moving average charts 
in finance was provided. 
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When using symbolic projection to solve our problem, a new concept - in- 
troducing benchmark object - was developed. This idea can be applied to other 
difficult problems such as dynamic route planning. 

There is much work left for reasoning with multimedia information. This 
paper is a small step in this direction. 
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Abstract. The aim of this position paper is to outline a unified view of plausible 
reasoning under incomplete information and belief revision, based on an ordinal 
representation of uncertainty. The information possessed by an agent is supposed 
to be made of three items: sure observations, generic knowledge and inferred 
contingent beliefs. The main notion supporting this approach is the confidence 
relation, a partial ordering of events which encodes the generic knowledge of an 
agent. Plausible inference is achieved by conditioning. The paper advocates the 
similarity between plausible reasoning with confidence relations and probabilistic 
reasoning. The main difference is that the ordinal approach supports the notion of 
accepted beliefs forming a deductively closed set, while probability theory is not 
tailored for it. The framework of confidence relations sheds light on the 
connections between some approaches to non-monotonic reasoning methods, 
possibilistic logic and the theory of belief revision. In particular the distinction 
between revising contingent beliefs in the light of observations and revising the 
confidence relation is laid bare. 



1 Introduction 

The aim of this position paper is to present a synthetic view of an approach to plausible 
reasoning under incomplete information with an ordinal representation of uncertainty. 
This approach has close connections with various works carried out more or less 
independently by philosophers like Ernest Adams, David Lewis, in the seventies, and 
Peter Gardenfors and colleagues in the eighties, as well as several AI researchers, such 
as Yoav Shoham, Daniel Lehmann, Judea Pearl, Joe Halpern, Maryanne Williams, and 
others. Indeed, the issue of an ordinal approach to uncertainty has to do with several 
important topics of theoretical Artificial Intelligence, such as non-monotonic reasoning, 
belief revision, conditional logics, and to probabilistic reasoning as well. It seems that 
the ultimate aim of symbolic AI in plausible reasoning is to perform a counterpart to 
probabilistic inference without probabilities (Dubois and Prade 1994a). 

Suppose an agent who has to reason about the current state of the world. In order to 
support the above thesis, we claim that the issue of plausible reasoning cannot be 
properly addressed without assuming that the body of information possessed by the 
agent contains (at least) three distinct types of items: observations pertaining to the 
current situation, generic knowledge about similar situations, and beliefs as to the 
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features of the current situation. Observations are supposed to be reliable and non 
conflicting. Beliefs are on the contrary taken for granted, hence brittle. Under this 
assumption, plausible reasoning precisely consists in inferring beliefs from (contingent) 
observations, and generic (background) knowledge, valid across situations. This view is 
classical in probability theory (De Finetti, 1974), and we claim that it also makes sense 
under a qualitative or an ordinal approach to plausible reasoning 

2 From Confidence Relations to Accepted Beliefs 

The ordinal approach presupposes that the agent’s knowledge is modeled by a relation 
among events or propositions (built from a language), we call a confidence relation. 
Typically, it is a partial preordering on a set of propositions, that is consistent with 
classical deduction, expressing that some propositions are generally more plausible than 
(or at least as plausible as) others. The contingent observations available to the agent 
form a context according to which the confidence relation is conditioned. Confidence 
relations include comparative probability relations first introduced by De Finetti (1937) 
and Koopman (1940), and extensively studied by Savage (1954). All set-functions used 
in uncertainty modeling (probability measures, possibility measures, belief functions, 
etc.) generate confidence relations which are complete preorders. The set-functions 
studied by Friedman and Halpern (1996) under the name of "plausibility measures" 
generate confidence relations which are partial preorders. 

A proposition is called an accepted belief for the agent if it is more plausible (in the 
sense of the confidence relation) than its negation, in the context of available 
observations. The term "accepted belief" also means that the agent considers the derived 
conclusions as valid, until some further observation is obtained that questions them; 
lastly the agent is allowed to reason with accepted beliefs as if they were true, using 
classical logic. Hence, by assumption, accepted beliefs form a deductively closed set of 
propositions. Under this assumption, a confidence relation is called an acceptance 
relation (Dubois and Prade, 1995b; Dubois et al. 1998a). 

This logical closure condition for accepted beliefs has a drastic impact on the nature 
of acceptance relations. Basically it implies that, when a proposition A is more plausible 
than each of two other ones B and C, where A, B, C are mutually exclusive, considering 
the disjunction of B and C cannot form a proposition that is more plausible than the 
most plausible one A. In other words, a notion of negligibility is embedded in the 
acceptance relation. The closure condition, plus a few other uncontroversial ones (like 
monotony with respect to set inclusion) are enough to ensure the existence of a 
representation of the acceptance relation by means of a family of so called "comparative 
possibility relations" (Dubois et al. 2001). A proposition is then more plausible than 
another one if and only if the former is more possible than the latter in the sense of all 
comparative possibility relations in the family. 

Comparative possibility relations have been independently introduced by David 
Lewis (1973) in the seventies, in the framework of modal logics of counterfactuals, and 
Dubois (1986) in the scope of decision theory. Their numerical counterparts have been 
introduced by the economist Shackle (1961), the philosopher Cohen (1977), and the 
systems engineer Zadeh (1978) completely independently of one another. Comparative 
possibility relations are very simple confidence relations because each of them is 
completely specified by means of a single complete preordering of elementary events 
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(interpretations, states of nature, possible worlds) distinguishing between normal and 
less normal worlds. Namely a proposition is more possible than another if the most 
normal situation where the former is true is more plausible than all situations where the 
latter is true. The case when the ordering of elementary events is partial is studied by 
Halpern (1997). 

3 Nonmonotonic Reasoning and Default Rules 

In practice, the generic knowledge possessed by an agent is often expressed by means of 
"if then" rules. The condition part of a rule denotes a context (in the above sense: 
everything the agent has observed in a given situation) and the conclusion is an accepted 
belief of the agent in this context. Each such rule can thus be modeled as the statement 
that some proposition is more plausible than another one, and a rule base can be equated 
to a (partially defined) acceptance relation (or a plausibility measure after Friedman and 
Halpern, 1996). 

Each rule can also be modeled in the framework of a three-valued logic (Dubois and 
Prade, 1994b). In a given situation, a rule is true or false according to whether its 
conclusion is true or false, provided that its condition holds in this situation. Otherwise 
the rule takes the third truth- value which stands for "irrelevant". This is a so-called tri- 
event introduced by De Finetti (1937). Such generic rules form conditional knowledge 
bases, and the plausible inference of some proposition consists of syntactically deriving 
from a rule base a rule whose condition part exactly models the set of available 
observations, and whose conclusion part is the proposition under concern (Kraus et al, 
1991). In the formal framework of acceptance relations, this syntactic inference 
procedure yields a plausible proposition if and only if this proposition is an accepted 
belief in the prescribed context, in the sense of the acceptance relation (plausibility 
measure after Friedman and Halpern, 1996) induced by the rule base 

Indeed, under the logical closure condition, the plausible inference relation 
producing accepted beliefs according to a confidence relation also satisfies all 
postulates of preferential inference introduced by Kraus et al. (1991) for the purpose of 
computing what is entailed from a conditional knowledge base (except for the inference 
from a contradictory context), and conversely these postulates enable the confidence 
relation to be reconstructed. Similar properties have been laid bare in older conditional 
logics by Adams (1975), using infinitesimal probabilities, and, of course, Fewis (1973). 
A rule base also generates a family of comparative possibility relations (“ rankings of 
models ” after Fehmann and Magidor, 1992; see Dubois and Prade, 1995a). 

If the family of comparative possibility relations reduces to a single one, then the 
plausible relation satisfies the so-called rational monotony property introduced by 
Makinson. This feature is characteristic of comparative possibility relations (Benferhat 
et al., 1997). Plausible inference with a comparative possibility relation meets Shoham 
(1988)'s view of nonmonotonic inference, as classical inference from the most normal 
situations in a given context. Plausible inference under a comparative possibility 
relation can be syntactically managed in possibilistic logic (Dubois et al., 1994; Fang 
2000). When an acceptance relation corresponds to a family of more than one 
comparative possibility relations, there is a principle of information minimization that 
enables a unique comparative possibility relation in the family to be selected (Dubois 
and Prade, 1998). It is a most cautious choice ensuring a ranking of elementary events 
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that is as compact as possible. This selection process is at work in Pearl (1990)’s system 
Z and Lehmahn and Magidor (1992)’s "rational closure”, as well the possibilistic 
handling of default rule bases (Benferhat et al, 1998). Selecting a least informative 
comparative possibility relation in agreement with an acceptance relation is equivalent 
to attaching priorities to rules in a conditional knowledge base, the higher priorities 
being granted to the most specific rules (Pearl, 1990). 

4 Plausible Inference versus Probabilistic Reasoning 

The originality of the confidence relation approach to plausible inference is that, rather 
than starting from syntactic objects and intuitive postulates (like Lehmann and 
colleagues), our starting points are on the one hand the confidence relation that is 
thought of as a natural tool for describing an agent’s uncertain knowledge, and the 
notion of accepted belief on the other hand. This point of view enables plausible (non- 
monotonic) reasoning to be cast in the general framework of uncertain reasoning, which 
includes probabilistic reasoning. The analogy between plausible reasoning with 
accepted beliefs and probabilistic reasoning is now patent (see also Paris, 1994). In 
probabilistic reasoning, the confidence relation stems from a probability measure or a 
family thereof. A set of generic rules is then encoded as a set of conditional probabilities 
characterizing a family of probability measures. The most popular approach in AI is 
currently when this family reduces to a single one, and the set of conditional 
probabilities defines a Bayesian network (Pearl, 1988). When the probabilistic 
information is incomplete, the selection of a unique probability measure often relies on 
the principle of maximal entropy. A Bayesian network really represents generic 
knowledge, like any confidence relation. This network derives either from expert 
domain knowledge or from statistical data. The selection of a most cautious comparative 
possibility relation in agreement with an acceptance relation is similar to the selection of 
a unique probability measure using maximal entropy (Paris, 1994). 

Probabilistic inference with a Bayesian network consists in calculating the 
(statistical) conditional probability of a conclusion, where the conditioning event 
encodes the available observations (Pearl, 1988). The obtained conditional probability 
value is interpreted as the degree of belief of the conclusion in the current situation, 
assuming that this situation is a regular one in the context described by the observations. 
This procedure is very similar to the derivation of a plausible conclusion by 
conditioning an acceptance relation, or by deducing a rule from a rule base. The derived 
rule is valid "generally". Its conclusion is considered as an accepted belief in the current 
situation assuming that this situation is not an exceptional one in the context described 
by the observations modeled by the condition part of the derived rule. There is in fact a 
strong similarity between conditional probability and conditional possibility, and an 
ordinal form of Bayes rule exists for possibility theory (Dubois and Prade, 1998). 

Of course, there are also noticeable differences between probabilistic reasoning and 
ordinal plausible Inference: 

i) The latter does not quantify belief; 

ii) Plausible reasoning considers the most plausible situations and neglects others, 
while probability theory performs reasoning in the average. 
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iii) Lastly, probabilistic reasoning is not compatible with the notion of accepted 
belief. 

Indeed, the conjunction of two highly probable events may fail to be highly probable 
and may even turn out to be very improbable. However, the arbitrary conjunction of 
accepted beliefs is still an accepted belief (this is because we assume that the agent 
considers accepted beliefs as tentatively true). This property of ordinal plausible 
inference has been criticized by several authors (Kyburg 1988, Poole 1991). It means 
that ordinal plausible inference suffers from the so-called "lottery paradox" (one can 
believe that any given player in a one-winner lottery game will lose with arbitrary high 
probability, all the more so as players are numerous, but one cannot believe that all of them 
will lose). 

Yet, an acceptance relation can also be represented by means of a family of standard 
probability relations (Benferhat et al., 1999a, Snow, 1999). The corresponding 
probability measures are very special. They enforce a total ordering of states and are 
such that the probability of a state is always larger that the sum of the probabilities of 
less probable states. We call them "big-stepped probabilities". They are in some sense 
the total opposite of uniformly distributed ones (without expressing pure determinism, 
though). Indeed, in any context the most likely elementary event in the sense of a big- 
stepped probability occurs much more often than the disjunction of other elementary 
events. 

We cannot expect to find natural sample spaces equipped with such kinds of 
empirically observed statistical probability functions. But one may think that for 
phenomena which have significant regularities, without being purely deterministic (like 
birds flying!), there may exist at least one partition of the sample space, the elements of 
which can be ordered via a big-stepped probability, and form a set of conceptually 
meaningful states for the agent. The existence of probabilities in strict agreement with 
non-monotonic inference, may also resolve the lottery paradox, that has been proposed 
as a counterexample to the use of classical deduction on accepted beliefs. Indeed, for 
big-stepped probabilities (and only for them), the set of probable beliefs /A, P(A jC) > 
0.5} remains consistent and deductively closed for any context C. In the lottery 
example, it is implicitly assumed that all players have equal chance of winning. The 
underlying probability is uniform. Hence there is no regularity at all in the lottery game: 
no particular occurrence is typical and randomness prevails. It is thus unlikely that an 
agent can come up with a set of consistent default rules about the lottery game. 

On the contrary, plausible reasoning based on acceptance relations models an 
agent’s reasoning in front of phenomena which have very regular features (but where 
exceptional situations may nevertheless occur). We conjecture that domains where a 
body of default knowledge exists can be statistically modeled by big-stepped 
probabilities on a meaningful partition of the sample space. If this conjecture is valid, it 
points out a potential link between non-monotonic reasoning and statistical data, in a 
knowledge discovery perspective. An open problem along this line is as follows: Given 
statistical data on a sample space, find the "best" partition(s) of the sample space, on 
which big-stepped probabilities are induced and meaningful default rules can be 
extracted (See Benferhat et al. 2001b for preliminary results). The difference between 
other rule extraction techniques and the one suggested here, is that, in our view, the 
presence of exceptions is acknowledged in the very definition of symbolic rules for 
which the proportion of such exceptions is not explicit. 
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5 Two Kinds of Epistemic Revision 

The framework of confidence relations can also account for the AGM revision theory 
after Alchourron et al (1985), but it somehow questions the idea of Gardenfors and 
Makinson (1988) that revision and plausible inference are two sides of the same coin. 
Indeed, in Gardenfors revision theory, the agent only possesses a closed set of 
propositions (a belief set) and receives a sure input that can be understood as a new 
observation of the (static) world. So, the Gardenfors revision theory only accounts for 
the evolution of the beliefs of an agent who makes a new contingent observation. 

However, this is only one possible kind of epistemic revision. The distinction 
between contingent beliefs and generic knowledge forces to consider another meaning 
of revision: the revision of the generic knowledge, which is not the topic of the AGM 
theory. It consists of modifying the confidence relation upon arrival of new generic 
knowledge, for instance when the agent happens to acquire a new default rule on his 
domain of investigation. Then events that were thought to normally occur are now 
considered less normal. For instance a medical doctor does not modify his medical 
knowledge when he gets new test results for a patient. He just revises his beliefs about 
the patient state. However, a medical doctor may revise his medical knowledge when he 
reads a specialized book or attends a medicine conference. Several authors like Spohn 
(1988), Williams (1994), Boutilier and Goldszmidt (1995), Darwiche and Pearl (1997), 
Dubois and Prade (1997), Benferhat et al.(1999c) have considered tools and principles 
for generic knowledge revision, although the distinction between the two types of 
revision is not always so clear from reading these works. Indeed, there is no consensus 
on a general and systematic approach to that kind of epistemic change in the literature, 
and the same can be observed for problems of revision of Bayesian networks (which 
pertain to probability kinematics, see Domotor, 1985). 

The AGM revision theory only assumes that a belief set is replaced by another belief 
set, and it gives minimal rationality constraints relating the prior and the posterior belief 
sets. Thus doing, it may wrongly suggest that the posterior belief set can indeed be 
derived from the prior one and the input information only. The confidence relation 
framework shows that this is not the case. The calculation of the posterior belief set 
does not use the prior belief set. The posterior belief set is built by means of plausible 
inference from the generic knowledge encoded in the confidence relation conditioned 
on the new context formed by all the available observations, including the new one. This 
is what we called "focusing" in previous publications (Dubois et al., 1998b). 

The representation theorem of the AGM theory actually lays bare the existence of 
an epistemic entrenchment relation (basically the dual of a possibility relation, see 
Dubois and Prade, 1991) and confirms that the construction of the revised belief set can 
be expressed by conditioning this particular confidence relation on the input 
information. This strategy is the same as the one adopted when querying a Bayesian net 
on the basis of new observations. However, in the AGM theory, the epistemic 
entrenchment looks like a technical by-product of the formal construction, while we 
claim that this is the primitive object, and that all the belief sets are derived from it in 
every context. Concerning the iteration of contingent belief revision, suppose two inputs 
are obtained in a row. Note that since the inputs are considered as sure observations 
about a static world, they cannot be contradictory. When the second input arrives, a 
sound strategy is, in the AGM setting, to revise the original belief set (not the one 
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revised by the first input) by the conjunction of the two inputs. In practice, The case 
when two observations are inconsistent suggests that, either one of them is wrong, or 
they do not pertain to the same case. 

Some people have claimed that in order to iterate AGM belief revision, one needs to 
construct not only the new belief set, but also a new epistemic entrenchment relation. In 
the scope of the revision of contingent beliefs induced by generic knowledge, this is 
questionable. On the contrary, the same epistemic entrenchment should remain across 
successive revisions of contingent beliefs caused by new contingent observations. 
Similarly, in probabilistic reasoning (Pearl, 1988), the same Bayesian network is used 
when new observations come in. If the epistemic entrenchment must be revised, it 
means that the input information is a piece of generic knowledge, and such a kind of 
revision is not the purpose of the AGM theory. 

6 Conclusion 

To sum up, the framework of confidence relations provides a unified view of non- 
monotonic and probabilistic reasoning. It also points out the distinction between the 
revision of contingent beliefs (by focusing the confidence relations on the proper 
context formed by the observations) and the revision of the confidence relation itself. 
This distinction is made clear by considering that the information possessed by an agent 
is made of three items: sure observations, generic knowledge and inferred contingent 
beliefs. From a computational point of view, plausible inference from a confidence 
relation can be achieved using a standard theorem-prover in propositional logic, and 
comes down to a sequence of consistency tests. When the confidence relation takes the 
form of a unique possibility relation, like in system Z and the like, the problem can be 
encoded in possibilistic logic, which handles prioritized propositional bases, with a 
complexity of SAT * Log 2 U if there are n priority levels (Lang, 2001). 

Future lines of research in the ordinal approach to plausible reasoning include the 
modeling of independence (Dubois et al. 1997, Ben Amor et al. 2000) and the study of 
graphical models that would be the qualitative counterpart of Bayesian networks 
(Benferhat et al. 1999b). Some results indicate that possibilistic logic bases, conditional 
knowledge bases and possibilistic nets have the same expressive power (Benferhat et 
al., 2001a). However it is no clear which is the most natural framework for knowledge 
elicitation, and for practical computation. Lastly, by bridging the gap between 
probability and non-monotonic reasoning, the confidence relation approach paves the 
way to the data-driven learning of default rules. 
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